ssis best practices

13

Click here to load reader

Upload: rvijaysubs764835048

Post on 29-Oct-2014

141 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: SSIS Best Practices

SSIS – Best Practices Introduction

About This Document

This document outlines some best practices for developing ETL solutions in SQL Server Integration

Services (SSIS). This document has leveraged best practices from multiple sources including technical

papers, blogs and experts in the tool.

The purpose of establishing best practices will provide the following benefits:

Reusable and flexible ETL architecture solution

Solutions to common problems

Tips and tricks

Audience

The primary audience for this document is developers who will be involved in the detailed design and

implementation of the ETL solution. Secondary audiences are the architects who will be involved in

validating the guiding principles used in the design and implementation.

Pre-Requisites

The audience should be familiar with SSIS.

Scope

The scope of this document covers the various best practices to be followed when developing ETL

solutions in SSIS.

This document does not include:

Introduction to ETL, Business Intelligence or Data Warehouse concepts

Best practices for an ETL solution (e.g. architecture, logging, auditing etc.)

Introduction to SSIS

How to use a specific component of SSIS

Package Versions

This document uses:

Microsoft SQL Server Integration Services Designer version 9.00.3033.00

Microsoft Visual Studio 2005 Version 8.0.50727.762

Microsoft .NET Framework Version 2.0.50727 SP1

Installed Edition: IDE Standard Microsoft Visual Studio 2005 Premier Partner Edition - ENU

Service Pack 1 (KB926601)

Architecture and Setup

Master-Child Packages

In most cases, the problem is large enough to separate the solution into multiple packages based on a

logical property or module (e.g. dimensions and facts, line of business, helper packages etc.). Having

multiple packages will also be efficient as developers can work in parallel.

The master package should have minimal to none business extraction logic. Its main purpose is to call

Page 2: SSIS Best Practices

other packages based on dependencies of the data (dimensions before facts). The master package

would also be responsible for cycle run date range, auditing, global variable values, database

connection strings, error recovery etc. to pass to the child packages.

Master package example below:

Templates

Use templates to standardize package design, configuration, event handling and auditing.

An important item to remember when using a template package to create a new package is to generate

a new ID for the package. A new ID can be generated from the Properties window of the package. The

property is under Identification->ID->Generate New ID. SSIS may not log the package execution

correctly if they are sharing the same IDs.

To read more about the benefits of using templates, visit

http://blogs.conchango.com/jamiethomson/archive/2007/03/11/SSIS_3A00_-Package-Template.aspx

Page 3: SSIS Best Practices

Configurations

Package Configurations

Package configurations are a mechanism for dynamically changing properties of your IS objects and

components at run-time using values that are stored externally to the package.

Properties for variables (e.g. values, name, read-only, expression etc.), event handlers (e.g. error

handling, system error variables etc.), connection managers (connection string, security parameters

etc.), logging (e.g. events, description) and other general (e.g. checkpoint usage, failure recovery,

version control, access etc.) are easily configurable in SSIS by using the package configuration wizard.

During run-time, SSIS applies the configurations to the package based on the method and its order

created in the wizard.

SSIS provides a number of methods to store and pass these values to a package:

SQL Server

XML file

Environment variable

Registry settings

Parent Package Variable

Each method should be used as described below.

SQL Server

This is a form of a direct configuration method. It is the recommended way to store most of your

package configuration values. SSIS stores the values in a table with a default schema at a package,

property and value grain.

The developer has flexibility and control over the configurations as it can easily modify the configuration

values prior to run-time (e.g. date range for which the package should run for). The package itself has

control over the configurations during run-time as it can easily modify the property values (e.g. the

global error count variable whenever an error is encountered by any packages in the solution) with an

update to the configuration table.

Use a variable to store the connection string to the configuration table database for a connection

manager. You would then only need to update this variable value by using an indirect configuration

method to specify the table connection properties.

By storing values in a SQL Server table, we have a central repository to review and modify variable

values and package properties without performing the tedious task of opening the package, modifying

the values or property within the package and re-deploying the package.

XML File

An effective direct method for storing configurations as well but more complicated to modify

configurations during run-time. Each package of a solution can have its configurations in one file

allowing a central location to change package properties.

One can also use the XML configuration file as an indirect configuration (storing the direct configuration

table connection value) by setting the file location as a static configuration entry in the package

configuration wizard. You will need to ensure that when deploying the solution to different

environments, you must deploy the configuration file in the same structure in those environments.

Environment Variable

Leverage indirect configuration by using it to store the XML configuration file location or the SQL Server

Page 4: SSIS Best Practices

connection of the configuration table. This will allow you to have and easily switch between multiple

configuration sets by changing the environment variable value rather than modifying it at the package

level and re-building the package. Also, when deploying solutions to different environments, only the

environment variable would need to be changed to identify the new location or database connection for

the direct configuration.

Registry Settings

Similar usage as an environment variable except the direct configuration value is set in a registry key.

Parent Package Variable

Use it to pass global variable values from the parent package to the child package. Use it to pass

database connection string values, cycle run ID, audit data and other useful information across

packages. Ensure that you create a variable with the same data type in the child package else data

value compatibility errors may arise.

Configurable Packages

As read above, SSIS allows multiple ways to dynamically configure packages. This is a powerful

feature that can be used in multiple ways to make a package extensible and flexible. One should store

most of the values of variables used and expression property of a variable in the SQL Server

configuration table. Additionally, store the source queries used to extract data in a variable and

leverage the use of expressions to dynamically generate a query for a set of criteria. An example is

provided in the DynamicQuery package of the SSIS_Best_Practices solution.

Recommended Reading

SSIS Package Configuration in SQL Server 2005

http://www.sql-server-performance.com/articles/dba/package_configuration_2005_p1.aspx

Connection Managers For an overview of the plethora of connection managers available in SSIS, review the MS Connection

Manager white paper.

Native Providers

Experience tells us to use native providers. However, in the world of SSIS, native providers are usually

not the most efficient method to use. The white paper ‘SSIS 2008 Connectivity Options’ highlights

some of the third-party options available. You may need to install additional client connectors to

effectively connect to a database from SSIS (view the frequent error you may receive to connect to an

Oracle DB below).

Oracle

Oracle’s own provider for an Oracle DB is quite slow and inefficient for extracting data. The better

option is using the SSIS’ provider with a few performance tweaks to the registry of the client that will

host the packages.

In the registry structure - HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\OLEDB, change the values

of the following keys: FetchSize to ‘200’, CacheType to ‘File’, PLSQLRSet to‘1’.

For writing to an Oracle DB, the provider from Persistent has great reviews and is supposedly 100

times faster than other providers. Visit

Page 5: SSIS Best Practices

http://www.persistentsys.com/ServiceOfferings/TimeToMarketAccelerators/Connectors/SSISOracleBulk

LoadConnector/tabid/195/Default.aspx for more information.

Microsoft has released new Oracle providers (included in SSIS 2008) which are supposedly much

faster and efficient than their previous counterparts. You can download them here:

http://www.microsoft.com/downloads/details.aspx?FamilyID=d9cb21fe-32e9-4d34-a381-

6f9231d84f1e&DisplayLang=en

A frequent error received while connecting to an Oracle DB from SSIS is “client and networking

components were not found”. The fix for this is to install the Oracle client for the machine you are

running SSIS from. This will install the correct libraries and drivers needed to connect to an Oracle

source.

Visit http://www.oracle.com/technology/software/tech/oci/instantclient/index.html to install the correct

client.

Recommended Reading

Connectivity and SQL Server 2005 Integration Services

http://technet.microsoft.com/en-us/library/bb332055.aspx

How to Choose the Best Connectors for SSIS

http://www.sqlservercentral.com/articles/SQL+Server+2005+-+SSIS/3209/

SSIS Data Sources Categories and Compatibility (highly recommended)

http://ssis.wik.is/Data_Sources

Data Flow Task The Data Flow Task is the most useful and important task. It facilitates the movement of data between

sources to destinations allowing complex transformations to take place within the flow. Eliminating

unnecessary work that the data flow task does is the single most important method to improve its

performance.

Using the highly recommended reference, http://www.simple-talk.com/sql/sql-server-2005/sql-server-

2005-ssis-tuning-the-dataflow-task/, I have categorized the performance improvement tips below. Do

note that I have added or removed some tips based on experience.

General

Increase DefaultBufferMaxSize and DefaultBufferMaxRows - Increasing the values for these

two properties can boost performance by decreasing the number of buffers moving through the

data flow. However, you should avoid increasing the values too much to the point where the

Execution Engine starts swapping out buffers to disk. That would defeat the purpose.

Use match indexes for repeat data cleansing sessions- When the package runs again, the

transformation can either use an existing match index or create a new index. If the reference

table is static, the package can avoid the potentially expensive process of rebuilding the index

for repeat sessions of data cleaning. If you choose to use an existing index, the index is

created the first time that the package runs. If multiple Fuzzy Lookup transformations use the

same reference table, they can all use the same index. To reuse the index, the lookup

operations must be identical; the lookup must use the same columns. You can name the index

and select the connection to the SQL Server database that saves the index.

Implement Parallel Execution - Both the Execution Engine for the Dataflow Task and the

Page 6: SSIS Best Practices

Execution Engine for the Control Flow are multithreaded.

o Use EngineThreads Property – controls the number of worker threads the Execution

Engine will use. The default for this property is 5. However, as you now know, by

simply adding a few components, data flow thread requirements will quickly exceed the

default. Be aware of how many threads the data flow naturally needs and try to keep

the EngineThreads value reasonably close to it.

o Set MaxConcurrentExecutables – if there are multiple Dataflow Tasks in the Control

Flow, say 10 and MaxConcurrentExecutables is set to 4, only four of the Dataflow

Tasks will execute simultaneously. Set, test and measure various value combinations

of this property and the EngineThreads property to determine the optimal setting for

your packages.

Data Flow Sources

Other online links would urge you to use the DataReader Source as opposed to the OLE DB

Source for performance improvements. However, experience suggests that no credible

improvement is seen and one should use the OLE DB Source as the flexibility it provides to

move between environments and servers outweighs the negligible performance degradation.

Remove unnecessary columns - unneeded columns are columns that never get referenced in

the data flow. The Execution Engine emits warnings for unused columns, so they are easy to

identify. This makes the buffer rows narrower. The narrower the row, the more rows that can fit

into one buffer and the more efficiently the rows can be processed.

o NOTE: Buffers are objects with associated chunks of memory that contain the data to

be transformed. As data flows through the Dataflow Task, it lives in a buffer from the

time that the source adapter reads it until the time that the destination adapter writes it.

Binary Large Objects (BLOBs) are particularly burdensome to the Dataflow Task and

should be eliminated if at all possible. Use the queries in the source adapters to

eliminate unnecessary columns.

Use a SQL select statement to retrieve data from a view- avoid using the “Table or view”

access mode in the OLE DB Source Adapter. It is not as fast as using a SELECT statement

because the adapter opens a rowset-based on the table or view. Then it calls OpenRowset in

the validation phase to retrieve column metadata, and later in the execution phase to read out

the data.

o Testing has shown that using a SELECT statement can be at least an order of

magnitude faster, because the adapter issues the specified command directly through

the provider and fetches the data using sp_prepare without executing the command,

avoiding the extra roundtrip and a possibly inappropriate cached query plan.

Optimize source queries - using traditional query optimization techniques optimize the source

adapter SQL query. SSIS doesn't optimize the query on your behalf, but passes it on verbatim.

Set based operations - when possible, perform set based operations in SQL Server. For

example, SQL Server can generally sort faster than the sort transform, especially if the table

being sorted is indexed. Set based operations such as joins, unions, and selects with ORDER

BY and GROUP BY tend to be faster on the server.

If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output.

This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this

value does not perform a sort operation; it only indicates that the data it sorted.

If you need a dynamic SQL statement in an OLE DB Source component, set

AccessMode="SQL Command from variable" and build the SQL statement in a variable that

has EvaluateAsExpression=TRUE.

If you can, filter your data in the Source Adapter rather than filter the data using a Conditional

Page 7: SSIS Best Practices

Split transform component. This will make your data flow perform quicker.

Break out all tasks requiring the Jet engine (Excel or Access data sources) into their own

packages that do nothing but that data flow task. Load the data into Staging tables if

necessary. This will ensure that solutions can be migrated to 64bit with no rework as there is no

64bit Jet driver.

Flat File and Other File Sources

Retrieving data from file sources presents its own set of performance challenges because the data is

typically in some format that requires conversion. For example, the Jet Provider only supports a limited

set of data types when reading Excel files and flat file data are always of type string until converted.

Here are a few hints on how to eliminate unnecessary data flow work:

Combine Adjacent Unneeded Flat File Columns – to eliminate unnecessary parsing

Leave Unneeded Flat File Columns as Strings – don't convert them to dates etc. unless

absolutely necessary

Eliminate Hidden Operations – mostly the Dataflow Task is explicit about what it is doing.

However, there are some components that perform hidden or automatic conversions. For

example, the Flat File Source Adapter will attempt to convert external column types to their

associated output column types. Use the Advanced Editor to explore each column type so that

you know where such conversions occur.

Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the

Type Conversion Transform if possible. Even better, if possible, modify the source column data

type to match the type needed in the data flow.

Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines

that replace some of the SSIS locale-specific parsing functions.

Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when

deploying completed packages to production, be mindful and careful about the log entries you

leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat

verbose.

When using raw files and your Raw File Source Component and Raw File Destination

Component are in the same package, configure your Raw File Source and Raw File

Destination to get the name of the raw file from a variable. This will avoid hardcoding the name

of the raw file into the two seperate components and running the risk that one may change and

not the other.

Variables that contain the name of a raw file should be set using an expression. This will allow

you to include the value returned from System::PackageName in the raw file name. This will

allow you to easily identify which package a raw file is to be used by.

o Note: This approach will only work if the Raw File Source Component and Raw File

Destination Component are in the same package.

Data Flow Transformations

Be mindful of transforms with internal File IO - Some of the stock dataflow transforms perform

internal file Input/Output. For example, the Raw Source and Destination, Import/Export Column

Transforms, Flat File Source and Destination and Excel File Source and Destination Adapters

are all directly impacted by the performance of the file system. File IO isn't always a bottleneck,

but when combined with low memory conditions, causing spooling or with other disk intense

operations, it can significantly impact performance. Components that read and write to disk

should be scrutinized carefully, and if possible, configured to read and write to dedicated hard

drives. Look for ways to optimize the performance of the hard drives using RAID,

defragmentation and/or correct partitioning.

Page 8: SSIS Best Practices

Monitor memory intensive transforms - if your package is memory bound, look for ways to

eliminate the memory intensive transforms or shift them to another package. Some transforms

such as the Aggregate, Lookup and Sort use a lot of memory. The Sort, for example, holds all

buffers until the last buffer and then releases the sorted rows. If memory runs low, these

transforms may spool to disk, causing expensive hard page faults.

o Monitor other memory intensive applications - when running on the same machine as

other memory intensive applications, the data flow can become memory starved, even

if there is plenty of memory on the machine. This is typically true when running

packages on the same machine with SQL Server. SQL Server is aggressive about

using memory. You can use the sp_configure system stored procedure instruct SQL

Server to limit its memory usage.

Lookup Transformation

Pare down the Lookup reference data - the default lookup query for the Lookup Transform is

SELECT * FROM TABLE(S). Instead, select the option to use the results of a query for the

reference data. Generally, the reference data should only contain the key and the desired

lookup column. So, for a dimension table lookup, that would be the natural key and the

surrogate key.

Use Lookup Partial or Full Cache mode - depending on the requirements and the data, you

should choose one of these two modes to speed up the Lookup. Partial cache mode is useful

when the incoming data is repetitive and only references a small percentage of the total

reference table. Full cache mode is useful when the reference table is relatively small and the

incoming data references the full spectrum of reference table rows.

LOOKUP components will generally work quicker than MERGE JOIN components where the 2

can be used for the same task (Case study:

http://blogs.conchango.com/jamiethomson/archive/2005/10/21/2289.aspx)

Slowly Changing Dimension (SCD) Transformation

It is interesting that SSIS does provide an out-of-the-box solution for SCDs however for most

practical purposes, the performance is extremely poor.

Some other cons of this transform are:

o It can be slow when applied to large dimensions, i.e. anything over 1,000 rows.

o You can’t configure the implied comparison/lookup between the two tables. It is strictly

a .NET comparison for equality. That means, for example, that a varchar(10) may not

compare correctly to a char(10), even though to the human eye they are identical.

o If the results of any part of the comparison results in a NULL, the transformation throws

an exception, which may or may not cause the entire package to abort. It depends on

how you have “programmed” the package.

o It can’t handle any “special cases”. Like updating Oracle tables.

An alternative approach to using SSIS’s SCD transformation would be to consider using an

open source solution such as the Kimbal Method SSIS SCD Component

(http://www.codeplex.com/kimballscd).

Another alternative approach would be to use existing SSIS components and build a solution

for the SCD types as shown below:

Page 9: SSIS Best Practices
Page 10: SSIS Best Practices
Page 11: SSIS Best Practices
Page 12: SSIS Best Practices
Page 13: SSIS Best Practices

Data Flow Destination

Other online links would urge you to use the DataReader Destination as opposed to the OLE

DB Destination for performance improvements. However, experience suggests that no credible

improvement is seen and one should use the OLE DB Destination as the flexibility it provides to

move between environments and servers outweighs the negligible performance degradation.

Set the Commit Size - The Commit Size option allows you to set a larger buffer commit size for

loading into SQL Server. This setting is only available in the OLEDB Destination when using

the SQL Server OLEDB driver. A setting of zero indicates that the adapter should attempt to

commit all rows in a single batch.

Turn on Table Lock - This option is available in the OLEDB Destination Editor. Selecting "Table

Lock" also enables fast load, which tells the adapter to use the IRowsetFastload bulk insert

interface for loading.

o NOTE: Fastload delivers much better performance, however it does not provide as

much information if there is an error. Generally, for development you should turn it off

and then turn it on when deploying to production.

Disable Constraints - This option is also available in the OLEDB Destination Editor by

unchecking the "Check constraints" option.

Trash Destination

During development, it is quite useful to have a destination that consumes rows but does not

necessarily provide any functionality. There is an adapter developed by Konesans (along with many

other useful adapters), that does exactly that. Using it with a Data Viewer task allows you to monitor

rows that could have been potentially rejected by the previous component in the data flow.

You can download the adapter here: http://www.sqlis.com/post/Trash-Destination-Adapter.aspx

References

Links

http://www.sqlis.com/post/Easy-Package-Configuration.aspx

http://www.sqlis.com/post/Trash-Destination-Adapter.aspx

http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/

http://www.whitworth.org/2009/04/14/todd-mcdermids-blog-an-alternative-to-ssiss-scd-wizard/