Download - Incremental Loading

Transcript
Page 1: Incremental Loading

Andy Leonard

SSIS and ETLThoughts about Database and Software Development, and the tools of the trade.

SSIS Design Pattern - Incremental Loads Introduction Loading data from a data source to SQL Server is a common task. It's used in Data Warehousing, but increasingly data is being staged in SQL Server for non-Business-Intelligence purposes.  Maintaining data integrity is key when loading data into any database. A common way of accomplishing this is to truncate the destination and reload from the source. While this method ensures data integrity, it also loads a lot of data that was just deleted. Incremental loads are a faster and use less server resources. Only new or updated data is touched in an incremental load. When To Use Incremental Loads Use incremental loads whenever you need to load data from a data source to SQL Server. Incremental loads are the same regardless of which database platform or ETL toolupdated rows - and separate these from the unchanged rows.  Incremental Loads in Transact-SQL I will start by demonstrating this with T-SQL: 0. (Optional, but recommended) Create two databases: a source and destination database for this demonstration: 

CREATE DATABASE [SSISIncrementalLoad_Source]

CREATE DATABASE [SSISIncrementalLoad_Dest]

1. Create a source named tblSource with the columns ColID, ColA, ColB, and ColC; USE SSISIncrementalLoad_SourceGOCREATE TABLE dbo.tblSource

This Blog

Home About Email Links

Syndication

RSS 2.0 Atom 1.0

Recent Posts

Just a Haircut?

SQL Saturday #48 Recap

Business Losses and "I Don't Know"

Project Phoenix

Presentin g Why Consider Semantic Integration? 29 Sep 2010!

Tags

(PRODU CT) RED

24HOP Agile Aiming-

for-the-top-of-the-list

ALM Andy's

Page 2: Incremental Loading

(ColID int NOT NULL ,ColA varchar(10) NULL,ColB datetime NULL constraint df_ColB default (getDate()),ColC int NULL,constraint PK_tblSource primary key clustered (ColID)) 2. Create a Destination table named tblDest with the columns ColID, ColA, ColB, ColC: USE SSISIncrementalLoad_DestGOCREATE TABLE dbo.tblDest(ColID int NOT NULL ,ColA varchar(10) NULL,ColB datetime NULL,ColC int NULL) 3. Let's load some test data into both tables for demonstration purposes: USE SSISIncrementalLoad_SourceGO

-- insert an "unchanged" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(1, 'B', '1/1/2007 12:02 AM', -2)

-- insert a "new" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(2, 'N', '1/1/2007 12:03 AM', -3)

USE SSISIncrementalLoad_DestGO

-- insert an "unchanged" rowINSERT INTO dbo.tblDest(ColID,ColA,ColB,ColC)VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" rowINSERT INTO dbo.tblDest(ColID,ColA,ColB,ColC)VALUES(1, 'C', '1/1/2007 12:02 AM', -2)

4. You can view new rows with the following query:

SELECT s.ColID, s.ColA, s.ColB, s.ColC FROM SSISIncrementalLoad_Source.dbo.tblSource sLEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColID

Crazy Questions

April Fool's Day

Azure Books Business

Intelligence

Change Data Capture

Cisco Code

Camp Complexi

ty Compone

nts Custom

Tasks Data

Profiing Task

Data Warehouse

database design

Database Developer

database developers

Database Edition

Database Testing

DBA Dell Deploym

ent Design

Pattern Develope

r Community

Doing Software Right

Elegant Design

EMPs

Page 3: Incremental Loading

WHERE d.ColID IS NULL

This should return the "new" row - the one loaded earlier with ColID = clauses are the key. Left Joins return all rows on the left side of the join(SSISIncrementalLoad_Source.dbo.tblSource in this case) whether there's a match on the right side of the join clause (SSISIncrementalLoad_Dest.dbo.tblDest in this case) or not. If there is no match on the right side, NULLs are returned. This is why the WHERE clause works: it goes after rows where the destination ColID is NULL. These rows have no match in the LEFT JOIN, therefore they must be new.

This is only an example. You occasionally find database schemas that are this easy to load. Occasionally. Most of the time you have to include several columns in the JOIN ON clause to isolate truly new rows. Sometimes you have to add conditions in the WHERE clause to refine the definition of truly new rows.

Incrementally load the row ("rows" in practice) with the following T-SQL statement:

INSERT INTO SSISIncrementalLoad_Dest.dbo.tblDest(ColID, ColA, ColB, ColC)SELECT s.ColID, s.ColA, s.ColB, s.ColC FROM SSISIncrementalLoad_Source.dbo.tblSource sLEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColIDWHERE d.ColID IS NULL

5. There are many ways by which people try to isolate changed rows. The only sure-fire way to accomplish it is to compare each field. View changed rows with the following T-SQL statement:

SELECT d.ColID, d.ColA, d.ColB, d.ColCFROM SSISIncrementalLoad_Dest.dbo.tblDest dINNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID = dWHERE ((d.ColA != s.ColA)OR (d.ColB != s.ColB) OR (d.ColC != s.ColC))

This should return the "changed" row we loaded earlier with ColID = 1 and ColA = 'C'. Why? The INNER JOIN and WHERE clauses are to blame - again. The INNER JOIN goes after rows with matching ColID's because of the JOIN ON clause. The WHERE clause refines the resultset, returning only rows where the ColA's, ColB's, match. This is important. If there's a difference in any or some or all the rows (except ColID), we want to update it.

Extract-Transform-Load (ETL) theory has a lot to say about when and how to update changed data. You will want to pick up a good book on the topic to learn more about the variations.

To update the data in our destination, use the following T-SQL: 

UPDATE dSETd.ColA = s.ColA,d.ColB = s.ColB,d.ColC = s.ColCFROM SSISIncrementalLoad_Dest.dbo.tblDest dINNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID = dWHERE ((d.ColA != s.ColA)

(Expensive Management Practices)

Engineers ETL ETL

Instrumentation

Excel Expressio

n Language

Geek I-Am-

Such-A-Geek

Incremen tal

Installatio n

Interview s

Laptops Leadershi

p LINQ Logic measure

ment Mentorin

g Microsoft MSDN MVP No Pony PASS PASS

Board Elections 2010

PASS Summit 2008

PASS Summit 2009

PASS Summit 2010

PASS Virtual Chapters

Personal Presentati

Page 4: Incremental Loading

OR (d.ColB != s.ColB) OR (d.ColC != s.ColC)) Incremental Loads in SSIS  Let's take a look at how you can accomplish this in SSIS using the Lookup Transformationfunctionality) combined with the Conditional Split (for the WHERE clause conditions) transformations. Before we begin, let's reset our database tables to their original state using the following query:

USE SSISIncrementalLoad_SourceGO

TRUNCATE TABLE dbo.tblSource

-- insert an "unchanged" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(1, 'B', '1/1/2007 12:02 AM', -2)

-- insert a "new" rowINSERT INTO dbo.tblSource(ColID,ColA,ColB,ColC)VALUES(2, 'N', '1/1/2007 12:03 AM', -3)

USE SSISIncrementalLoad_DestGO

TRUNCATE TABLE dbo.tblDest

-- insert an "unchanged" rowINSERT INTO dbo.tblDest(ColID,ColA,ColB,ColC)VALUES(0, 'A', '1/1/2007 12:01 AM', -1)

-- insert a "changed" rowINSERT INTO dbo.tblDest(ColID,ColA,ColB,ColC)VALUES(1, 'C', '1/1/2007 12:02 AM', -2)

Next, create a new project using Business Intelligence Development Studio (BIDS). Name the project SSISIncrementalLoad:

ons Project

Management

quality Reporting

Services Rosario scalable Service

Packs Social

Intelligence

Social Sunday

Software Business

software developers

Software Testing

SQL Saturday

SQL Server

SQL Server 2008

SQL Server Data Service

SQL Server MVP Deep Dives

SQL University

SQLLunc h

SQLRall y

SQLU SSDS SSIS SSIS

Snack SSMS Support Team

Edition for

Page 5: Incremental Loading

Once the project loads, open Solution Explorer and rename Package1.dtsx to SSISIncrementalLoad.dtsx:

Database Professionals

Team System

Training Transpare

ncy T-SQL

Snack T-SQL

Tuesday User

Groups utilities Virtual

Chapters Virtualiza

tion Vista Visual

Studio Visual

Studio 2008

Windows 7

Writing x64

News

Follow me on

Archives

October

Page 6: Incremental Loading

When prompted to rename the package object, click the Yes button. From the toolbox, drag a Data Flow onto the Control Flow canvas:

2010 (3) Septembe

r 2010 (12)

August 2010 (14)

July 2010 (18)

June 2010 (8)

May 2010 (12)

April 2010 (3)

March 2010 (16)

February 2010 (20)

January 2010 (22)

Decembe r 2009 (14)

Novembe r 2009 (5)

October 2009 (4)

Septembe r 2009 (7)

August 2009 (4)

July 2009 (5)

June 2009 (6)

May 2009 (1)

April 2009 (2)

March 2009 (5)

February 2009 (5)

January 2009 (3)

Decembe r 2008 (3)

Novembe r 2008 (4)

October 2008 (3)

Septembe r 2008 (3)

August 2008 (4)

July 2008

Page 7: Incremental Loading

 

Double-click the Data Flow task to edit it. From the toolbox, drag and drop an OLE DB Source onto the Data Flow canvas: 

(9) June

2008 (7) May

2008 (12) April

2008 (22) March

2008 (10) February

2008 (8) January

2008 (3) Decembe

r 2007 (12)

Novembe r 2007 (13)

October 2007 (3)

Septembe r 2007 (3)

August 2007 (5)

July 2007 (7)

Page 8: Incremental Loading

 

Double-click the OLE DB Source connection adapter to edit it:

Page 9: Incremental Loading

Click the New button beside the OLE DB Connection Manager dropdown:

Page 10: Incremental Loading

Click the New button here to create a new Data Connection:

Page 11: Incremental Loading

Enter or select your server name. Connect to the SSISIncrementalLoad_Source database you created earlier. Click the OK button to return to the Connection Manager configuration dialog. Click the OK button to accept your newly created Data Connection as the Connection Manager you wish to define. Select "dbo.tblSource" from the Table dropdown:

Page 12: Incremental Loading

Click the OK button to complete defining the OLE DB Source Adapter.

Drag and drop a Lookup Transformation from the toolbox onto the Data Flow canvas. Connect theadapter to the Lookup transformation by clicking on the OLE DB Source and dragging the green arrow over the Lookup and dropping it. Right-click the Lookup transformation and click Edit (or double-click the Lookup transformation) to edit:

Page 13: Incremental Loading

When the editor opens, click the New button beside the OLE DB Connection Manager dropdown (as you did earlier for the OLE DB Source Adapter). Define a new Data Connection - this time to the SSISIncrementalLoad_Dest database. After setting up the new Data Connection and Connection Manager, configure the Lookup transformation to connect to "dbo.tblDest":

Page 14: Incremental Loading

Click the Columns tab. On the left side are the columns currently in the SSIS data flow pipeline (from SSISIncrementalLoad_Source.dbo.tblSource). On the right side are columns availablejust configured (from SSISIncrementalLoad_Dest.dbo.tblDest). Follow the following steps:

1. We'll need all the rows returned from the destination table, so check all the checkboxes beside the rows in the

Page 15: Incremental Loading

destination. We need these rows for our WHERE clauses and for our JOIN ON clauses.

2. We do not want to map all the rows between the source and destination - we only want to map the columns named ColID between the database tables. The Mappings drawn between the Available Input Columns and Available Lookup Columns define the JOIN ON clause. Multi-select the Mappings between ColA, ColB, and ColC by clicking on them while holding the Ctrl key. Right-click any of them and click "Delete Selected Mappings" to delete these columns from our JOIN ON clause.

3. Add the text "Dest_" to each column's Output Alias. These rows are being appended to the data flow pipeline. This is so we can distinguish between Source and Destination rows farther down the pipeline:

Page 16: Incremental Loading

Next we need to modify our Lookup transformation behavior. By default, the Lookup operates as an INNER JOIN - but we need a LEFT (OUTER) JOIN. Click the "Configure Error Output" button to open the "Configure Error Output" screen. On the "Lookup Output" row, change the Error column from "Fail component" to "Ignore failure". This tells the Lookup transformation "If you don't find an INNER JOIN match in the destination table for the Source table's ColID value, don't fail." - which also effectively tells the Lookup "Don't act like an INNER JOIN, behave like a LEFT JOIN":

Page 17: Incremental Loading

Click OK to complete the Lookup transformation configuration.

From the toolbox, drag and drop a Conditional Split Transformation onto the Data Flow canvas. Connect the Lookup to the Conditional Split as shown. Right-click the Conditional Split and click Edit to open the Conditional Split Editor:

Page 18: Incremental Loading

Expand the NULL Functions folder in the upper right of the Conditional Split Transformation Editor. Expand the Columns folder in the upper left side of the Conditional Split Transformation Editor. Click in the "Output Name" column and enter "New Rows" as the name of the first output. From the NULL Functions folder, drag and drop the "ISNULL( <<expression>> )" function to the Condition column of the New Rows condition:

Page 19: Incremental Loading

Next, drag Dest_ColID from the columns folder and drop it onto the "<<expression>>" text in the Condition column. "New Rows" should now be defined by the condition "ISNULL( [Dest_ColID] )". Thisrows - setting it to "WHERE Dest_ColID Is NULL".

Type "Changed Rows" into a second Output Name column. Add the expression "(ColA != Dest_ColA) || (ColB != Dest_ColB) || (ColC != Dest_ColC)" to the Condition column for the Changed Rows output. This defines our WHERE clause

Page 20: Incremental Loading

for detecting changed rows - setting it to "WHERE ((Dest_ColA != ColA) OR (Dest_ColB != ColB) OR (Dest_ColC != ColC))". Note "||" is used to convey "OR" in SSIS Expressions:

Change the "Default output name" from "Conditional Split Default Output" to "Unchanged Rows":

Page 21: Incremental Loading

Click the OK button to complete configuration of the Conditional Split transformation.

Drag and drop an OLE DB Destination connection adapter and an OLE DB Command transformation onto the Data Flow canvas. Click on the Conditional Split and connect it to the OLE DB Destination. A dialog will display prompting you to select a Conditional Split Output (those outputs you defined in the last step). Select the New Rows output:

Page 22: Incremental Loading

Next connect the OLE DB Command transformation to the Conditional Split's "Changed Rows" output:

 Your Data Flow canvas should appear similar to the following:

Page 23: Incremental Loading

Configure the OLE DB Destination by aiming at the SSISIncrementalLoad_Dest.dbo.tblDest table:

Page 24: Incremental Loading

Click the Mappings item in the list to the left. Make sure the ColID, ColA, ColB, and ColC sourcetheir matching destination columns (aren't you glad we prepended "Dest_" to the destination columns?):

Page 25: Incremental Loading

 

Click the OK button to complete configuring the OLE DB Destination connection adapter.

Double-click the OLE DB Command to open the "Advanced Editor for OLE DB Command" dialog. Set the Connection Manager column to your SSISIncrementalLoad_Dest connection manager:

Page 26: Incremental Loading

 

Click on the "Component Properties" tab. Click the elipsis (button with "...")

Page 27: Incremental Loading

 The String Value Editor displays. Enter the following parameterized T-SQL statement into the String Value textbox:

UPDATE dbo.tblDestSETColA = ?,ColB = ?

Page 28: Incremental Loading

,ColC = ?WHERE ColID = ?

 

 The question marks in the previous parameterized T-SQL statement map by ordinal to columns named "Param_0" through "Param_3". Map them as shown below - effectively altering the UPDATE statement

UPDATE SSISIncrementalLoad_Dest.dbo.tblDestSETColA = SSISIncrementalLoad_Source.dbo.ColA,ColB = SSISIncrementalLoad_Source.dbo.ColB,ColC = SSISIncrementalLoad_Source.dbo.ColCWHERE ColID = SSISIncrementalLoad_Source.dbo.ColID

Note the query is executed on a row-by-row basis. For performance withset-based updates instead.

Page 29: Incremental Loading

 Click the OK button when mapping is completed.

Your Data Flow canvas should look like that pictured below:

Page 30: Incremental Loading

If you execute the package with debugging (press F5), the package should succeed and appear as shown here:

Page 31: Incremental Loading

Note one row takes the "New Rows" output from the Conditional Split, and one row takes the "Changed Rows" output from the Conditional Split transformation. Although not visible, our third source row doesn't change, and would be sent to the "Unchanged Rows" output - which is simply the default Conditional Split output renamed. Any row that doesn't meet any of the predefined conditions in the Conditional Split is sent to the default output.

Page 32: Incremental Loading

That's all! Congratulations - you've built an incremental database load! [:)]

Get the code! (Free registration required)

:{> AndyPublished Monday, July 09, 2007 3:13 PM by andyleonard Filed under: Design Pattern, Incremental, SSIS

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

Jason Haley said:

July 10, 2007 9:09 AM

Jason Haley said:

July 10, 2007 9:10 AM

Alberto Ferrari said:

Andy, maybe you are interested in taking a look at the TableDifference component I published at http://www.sqlbi.eu.

It is an all-in-one and completely free SSIS component that handles these kind of situations without the need to cache data in the Lookup. Lookups are nice but - in real situaton - they may shortly lead to out of memory situations (think at a hundred million rows table... it simply cannot be cached in memory).

Beware that - for huge table comparison - you will need both TableDifference AND the FlowSync component that you can find at the same site.

I'll be glad to hear your comments about it.

Alberto

July 12, 2007 5:21 AM

Page 33: Incremental Loading

andyleonard said:

Thanks Alberto! Checking it out now.

:{> Andy

July 13, 2007 9:30 PM

David R Buckingham said:

Thank you greatly Andy.  This couldn't have come at a better time as I just started using Integration Services for the first time on Friday to handle eight different data loads (all for a single client).  Four of the data loads are straight appends, but the other four are incremental.

This approach is vastly superior to loading the incremental data into a temporary table and then processing it against the destination table.  In fact, it proved to be more efficient than both set-based insert/updates or a cursor-based approach.  Yes, I tested both approaches prior to implementing yours.  Your approach was faster than the set-based insert/updates even though I tested it across the WAN which suprised me greatly.

I also created a script to assist with the creation of the Conditional Split "Changed Rows" condition which follows (be sure your results aren't being truncated when you have a table with many columns):

--- BEGIN SCRIPT ---

DECLARE @Filter varchar(max)

SET @Filter = ''

-- ((ISNULL(<ColumnName>)?"":<ColumnName>)!=(ISNULL(Dest_<ColumnName>)?"":Dest_<ColumnName>)) ||

SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?"":' + c.[name] + ')!=(ISNULL(Dest_' + c.[name] + ')?"":Dest_' + c.[name] + ')) || '

FROM sys.tables t

INNER JOIN sys.columns c

ON t.[object_id] = c.[object_id]

WHERE SCHEMA_NAME( t.[schema_id] ) = 'GroupHealth'

AND t.[name] = 'ConsumerDetail'

AND c.[is_identity] = 0

AND c.[is_rowguidcol] = 0

Page 34: Incremental Loading

ORDER BY

c.[column_id]

SET @Filter = LEFT( @Filter, LEN( @Filter ) - 2 )

SELECT @Filter

--- END SCRIPT ---

Again, thanks greatly.  I now have 2 SSIS books on there way to me.  I am eager to learn as much as I can.

July 17, 2007 3:52 PM

Bill Mo said:

Hello,Andy!Thanks a lot for your incremental process!I'm doing SSIS project!

July 17, 2007 9:47 PM

david boston said:

Thanks this worked a treat for my SSIS project.

July 20, 2007 5:01 AM

andyleonard said:

Hi David, Bill, and David,

  Thanks for the feedback!

:{> Andy

August 8, 2007 7:14 PM

saul said:

Hi Andy !!  Great work... I was scared because of this Incremental load... and you saved my weekend... now I can enjoy it .... :-)

September 7, 2007 5:56 PM

Steve Hall said:

Anyone had a problem with the insert and update commands locking each other out?

Didn't happen at first but does now.  Update gets blocked by the insert and it just hangs.

Page 35: Incremental Loading

Steve

September 18, 2007 1:18 PM

andyleonard said:

Thanks Saul!

Steve, are you sure there's not something more happening on the server that's causing this?

If this is repeatable, please provide more information and I'll be happy to take a look at it.

SQL Server does a fair job of detecting and managing deadlocks when they occur. I haven't personally seen SQL Server "hang" since 1998 - and then it was due to a failing I/O controller.

:{> Andy

September 27, 2007 6:57 PM

Bill Mo said:

Hi,Andy! I have a same problem with Steve,it is block. When bulk insert and update happen,Update gets blocked by the insert and it just hangs!Insert's wait type is ASYNC_NETWORK_IO.

October 8, 2007 4:15 AM

Bobby said:

Thx 4 the trick with Fail -> Left Join ! I was thinking how to do it whole day :o)

October 18, 2007 1:23 AM

Andy Leonard said:

Introduction This post is part of a series of posts on ETL Instrumentation. In Part 1 we built a database

November 18, 2007 10:53 PM

Michael Ross said:

Steve,

This most certainly can be the case with larger datasets.  In my case, I ran into this issue with large FACT table loads.  Either consider dumping the contents of the insert into a temp table or SSIS RAW datafile and complete the insert in a separate dataflow task or modify the isolationlevel of the package.  Be warned, make sure you research the IsolationLevel property thoroughly before making such a change.

Page 36: Incremental Loading

November 26, 2007 12:03 PM

Michael said:

What happens when a field is NULL in the destination or source when determining changed rows? Don't we need special checks to ensure if a destination field is NULL the source should also be? Thus a change has occured and the record should be updated?

December 26, 2007 10:26 AM

andyleonard said:

Hi Michael,

  Excellent question! This post was intended to cover the principles of Incremental Loads, and not as a demonstration of production-ready code. </CheesyExcuse>

  There are a couple approaches to handling NULLs in the source or destination, each with advantages and disadvantages. In my opinion, the chief consideration is data integrity and the next-to-chief consideration is metadata integrity.

  A good NULL trap can be tricky because NULL == NULL should never evaluate to True. I know NULL == NULL can evaluate to True with certain settings, but these settings also have side-effects. And then there's maintenance to consider... basically, there's no free lunch.

  A relatively straightforward method involves identifying a value for the field that the field will never contain (i.e. -1, "(empty)", or even the string "NULL") and using that value as a substitute for NULL. In the SSIS expression language you can write a change-detection expression like:

(ISNULL(Dest_ColA) ? -1 : Dest_ColA) != (ISNULL(ColA) ? -1 : ColA)

  But again, if ColA is ever -1 this will evaluate as a change and fire an update. Why does this matter? Some systems include "number of updated rows" as a validation metric.

:{> Andy

December 26, 2007 12:50 PM

Michael said:

Hi Andy,

Thanks for this great article!

Do you have any hints for implementing your design with an Oracle Source. I am attempting to incrementally update from a table with 7 million rows with ~50 fields. The Lookup Task failed when I attempted to use it like you described above due to a Duplicate Key error...cache is full. I googled this and found an article suggesting enabling restrictions and enabling smaller cache

Page 37: Incremental Loading

amounts. However it is now extremely slow. Do you have any experience/advice on tweaking the lookup task for my environment?

Is there value in attempting to port this solution to an Oracle to SQL environment?

Is there a way to speed things up/replace the lookup task by using a SQL Execution Task which calls a left outer join?

Is there major difference\impact in having multiple primary keys?

Thanks Again

December 26, 2007 1:47 PM

Andy Leonard said:

Now that our 5-month old son - Riley Cooper - is on the mend , I am hitting the speaking trail again!

January 6, 2008 6:16 PM

Jigs said:

Hi AndY looks great and work also great but if there are more records to update than it just hangs while doing insert and update so what should i do ..is there any workaround by which we can avoid hanging od SSIS pacage. Please Suggest

Thanks

Jigu

January 15, 2008 3:36 PM

andyleonard said:

Hi Bill and Jigu,

Although I mention set-based updates here I did not demonstrate the principle because I felt the post was already too long - my apologies.

I have since written more on Design Patterns. Part 3 of my series on ETL Instrumentnation (http://sqlblog.com/blogs/andy_leonard/archive/2007/11/18/ssis-design-pattern-etl-instrumentation-part-3.aspx#SetBasedUpdates) demonstrates set-based updates.

I need to dedicate a post to set-based updates.

:{> Andy

January 16, 2008 7:10 AM

Page 38: Incremental Loading

Jai said:

Hi Andy

Thanks you did great help to understand data update through SSIS

package

April 5, 2008 6:16 PM

Kenneth said:

Hi Andy,

I have a hard time following your instructions. Can you send me your sample project

Thank You

Kenneth

[email protected]

July 29, 2008 1:44 PM

andyleonard said:

Hi Kenneth,

  Sorry to hear you're having a hard time with my instructions.

  One of the last instructions is a link at the bottom of the page called "Get the code". It points to this URL: http://vsteamsystemcentral.com/dnn/Demos/IncrementalLoads/tabid/94/Default.aspx.

Hope this helps,

Andy

July 29, 2008 1:59 PM

EAD said:

Not sure posted same question few places….May be you gurus can explain

In SSIS Fuzzy grouping objects creates some temp tables and does the Fuzzy logic. I ran the trace to see how it does in one cursor it is taking very long time to process 150000 records. Same executes fine in any other test environments. The cursor is simple and I can post if needed. Any thoughts ?

September 11, 2008 8:45 PM

Page 39: Incremental Loading

LNelson said:

I have a similar package I am trying to create and this was a big help.  The new rows write properly however I am getting an error on the changed rows because the SQL table i am writing to has an auto incremented identity spec column. The changes won't write to the SQL table.  If I uncheck "keep identity" it writes new rows instead of updating existing.  What am I missing?

December 1, 2008 11:38 AM

FDA said:

Thanks a lot of Andy!! Very Helpful!

December 17, 2008 3:48 AM

Rajesh said:

Hi Andy..

  Thats the good alternative for slowly changing dimention...!!

  Welll done...

  What if the increamental is based on more than one columns...?

  And further to increase the complications, if any of the column

  included in the look up condition changes as well....?

  Last one...wht if the row is deleted from source....?

January 6, 2009 3:23 AM

Ken ([email protected]) said:

it looks like your package handles new and updated rows.

I don't see the code handling the deleted rows in source (asume that there is)

Here is my two cents.

in your lookup, you can split out the match and non-match rows.

non match means new record and you can do an insert directly after the lookup. you can elimninate the 'new row' in your condition in 'conditional split'

However, overall, your sample package is the best (at far as I have searched) sample on the net ( I love it, honestly).

Keep up the great work and giving out sample package.

Page 40: Incremental Loading

Like most people, I do appreciate your efford.

Ken

January 7, 2009 8:10 PM

andyleonard said:

Hi Ken,

  Thanks for your kind words.

  I believe you're referring to functionality new to the SSIS 2008 Lookup Transformation - there is no Non-Match Rows output buffer in the SSIS 2005 Lookup Transformation.

:{> Andy

January 7, 2009 9:58 PM

RVS said:

Hi Andy,

Thanks a lot for this article. It proved to be a great help for me.

I was wondering if you can provide some solution to handle deleted rows from source table using lookup. I need this because I have to keep the historical data in the data warehouse.

Thanks in advance,

RVS

[email protected]

January 21, 2009 3:04 AM

Charlie Asbornsen said:

Andy, thanks for your help and effort.  This is definitely more elegant than staging over to one database and then doing ExecuteSQLs to execute incremental loads.

January 21, 2009 5:16 PM

Charlie Asbornsen said:

And re ranvijay's question, I would assume that when the row exists in the destination but not the source, the source RowID would show up as null, so you could do that as another split on the conditional.

January 21, 2009 5:18 PM

Page 41: Incremental Loading

andyleonard said:

Hi RVS and Charlie,

  RVS, Charlie answered your question before I could get to it! I love this community!

  I need to write more on this very topic. New features in SQL Server 2008 change this and make the Deletes as simple as New and Updated rows.

  I didn't mention Deletes in this post because the main focus was to get folks thinking about leveraging the data flow instead of T-SQL-based solutions (Charlie, in regards to your first comment). There's nothing wrong with T-SQL. But a data flow is built to buffer (or "paginate") rows. It bites off small chunks, acts on them, and then takes another bite. This greatly reduces the need to swap to disk - and we all know the impact of disk I/O on SQL Server performance.

  Charlie is correct. The way to do Deletes is to swap the Source and Destinations in the Correlate / Filter stages.

  Typically, I stage Deletes and Updates in a staging table near the table to be Deleted / Updated. Immediately after the data flow, I add an Execute SQL Task to perform a correlated (inner joined) update or delete with the target table. I do this because my simplest option inside a data flow is row-based Updates / Deletes using the OLE DB Command transformation. A set-based Update / Delete is a lot faster.

  I need to write more about that as well...

:{> Andy

January 21, 2009 5:29 PM

Charlie Asbornsen said:

Andy,

Looks like I have some rewriting to do on the next version of the ETL.  It's a good thing I enjoy working in SSIS!

I'm working on building a data warehouse and BI solution for a government customer, and a lot of their 1970's era upstream data sources don't have ANY kind of data validation.  In fact when we first installed in production we found out that they had some code fields in their data tables with a single quote for data!  It played merry hob with our insert statements until we figured out what was happening. Then I got to figure out how to do D-SQL whitelisting with VB scripting in SSIS :)

Of course since its the government we'll probaby have to wait until 3Q 2010 before we're allowed to upgrade to SQL 2008.  We were all gung ho about VS 2008 (which we were allowed to get) but imagine my chagrin when I found out that I couldn't use my beloved BI Studio without SQL 2008... :P  So I'll be using this for the next version... and possibly the version after that as well.

Page 42: Incremental Loading

Thanks a bunch!

January 21, 2009 5:41 PM

Charlie Asbornsen said:

Me again.

I think I made a mistake.  If a row already exists in the destination table and it no longer exists in the source table, I want it deleted (sent to the deletes staging table).  However, the lookup limits the row set in memory to items that are already in the source table, so its not really functioning as an outer join.  Its perfect for determining inserts and updates, but I need to do something else to do deletes...

I'm going to try adding an additional OLE DB source and point that at the same table the lookup is checking... hmm, maybe try the Merge?  I'll see what happens and let you know.

January 22, 2009 12:41 PM

Charlie Asbornsen said:

Actually I think I need a second pass... grrr.

January 22, 2009 12:44 PM

Charles Asbornsen said:

Andy,

Please feel free to combine this with the previous reply.

What I wound up doing was creating a second data flow after the one that split the inserts and updates out.  The deletes flow populated a deleted rows staging table with the deleted row id, which then was joined to the ultimate destination table in a delete command in an Execute SQL task.  I would up reversing the lookup, but used the same technique by using a conditional split on whether or not the new column from the lookup was null, and if it was, the output went to the "deleted records" path, which populated the staging table.

The reason I want to actually remove the data from the table as opposed to merely marking it as deleted is because the reason a row would disappear would be because it was a bad reference code in the first place.  My big datawarehouse ETL adds new reference codes to the reference tables (which it needs to create in the first place because the source reference codes are held in these five gigantic tables which do not lend themselves to generating NV lists) for unmatched codes in the data tables (remember there's no validation at the source).  

When the reconciliation stick finally gets swung and the customer replaces the junk code it disappears from my ETL and I remove it from my table.  It is different from a code that gets obsoleted; there's a reason to track those, but

Page 43: Incremental Loading

garbage just needs to be thrown out.

Thanks again, I would have been very annoyed with myself if I wound up doing row-based IUDs...

January 22, 2009 2:55 PM

andyleonard said:

Hi Charles,

  I wasn't clear in my earlier response but you figured it out anyway - apologies and kudos. You do need to do the Delete in another Data Flow Task.

  Excellent work!

:{> Andy

January 22, 2009 4:15 PM

Charles Asbornsen said:

Andy,

Is there a limit to how many comparisons you can make in the Conditional Split Transformation Editor?  I have a table with 20 columns, and I'm trying to do 19 comparisons.  It's telling me that one of the columns doesn't exist in the input column collection.  I can cut the expression and paste it back in and it picks a different column to complain about.  Error 0xC0010009... it says the expression cannot be parsed, may contain invalid elements or might not be well formed, and there may also be an out-of-memory error.

I've been looking at it for 1/2 an hour and all the columns it is variously complaining about are present in the input column collection, so I suspect it's a memory error.  Should I alias the column names to be shorter (ie the problem is in the text box) or is it a metadata problem?  I'm going home now but tomorrow I will see if splitting the staging table into 4 tables and splitting the conditions into 4 outputs (to be recombined later by an execute SQL command into the real staging table) does what I need.

Thanks!

Charlie

January 22, 2009 5:54 PM

RVS said:

Hi Andy and Charles,

I thank you for your comments. I still have a few doubts related to handling Deleted columns. I have created a solution to handle all three cases(add,update and delete). I have taken two OLEDB Source(one with source and data and another with destination table's data) then I have SORTED them and MERGED

Page 44: Incremental Loading

them(with FULL OUTER join) and finally used CONDITIONAL SPLIT to filter New, updated and Deleted data and used the OLEDB Command to do the required action. I am getting Deleted rows by using full outer join.

I am getting expected result with this solution but I think this is not performance efficient as it is using sort, merge etc. I wanted to use Lookup as suggested by Andy. But the solution which you both have given is not fully clear to me. Will it be possible for you to send me a sketch of the proposed solution or explain it a bit in detail?

Charles, regarding no. of comparisons, I don't think it is limited to 19 or 20 because I have used more than 35 comparisons and that is working fine. Please check if you have checked for null columns correctly.

Thanks once again,

RVS

([email protected])

January 23, 2009 6:57 AM

Charlie Asbornsen said:

Doh! Thanks Ranvijay.

January 23, 2009 10:01 AM

Charlie Asbornsen said:

Actually what was happening was that since the comparison expression was so long I moved it into WordPad to type it and then copy/pasted into the rather annoyingly non-resizable condition field in the conditional split transformation editor.  It turns out it doesn't like that.  Maybe there were invisible control characters in the string, so I needed to just bite the bullet and type in the textbox.  It works fine now.

It would be nice to have a text visualizer for that field.

Thanks!

January 23, 2009 1:51 PM

vidhya said:

This was the excellent article and Andy illustration style is great.

Thank you

June 30, 2009 9:47 AM

Page 45: Incremental Loading

Nostromo said:

Great tutorial!  I'm new to SSIS and I worked through it without a hitch.

Thanks!!!

July 10, 2009 10:23 AM

DVL said:

Hellow,

Many thanks for the step by step guide.

It's nice to find a way to get your changed and new records in 2 separate outputs. But how who you get the deleted records? The only solution i found is to lookup every PK in the source db table and check if it still excists. If it does it will set the deleted_flag to 1. Do you have any idea to implement the deleted records into your solution? Mine is in a separete dataflow.

Greetings  

August 27, 2009 8:05 AM

CSu said:

Great article! I originally used sort, merge join (with left outer join) and conditional split transforms to perform incremental load. Unfortunately it did not work as expected. Your article has simplified my design and it is now working perfectly. Thanks for sharing. :)

October 26, 2009 7:26 AM

hasan said:

Dear Andy

your solution is great but i have problem. the dimensions are not getting populated with the default data. does this work on the excel source because i have an excel source.

December 29, 2009 7:31 AM

Mike said:

Hiya,

Just read the article, confirms my approach to incremental loading on a series of smallish facts.

I have used the "slowly changing dimension" element in the past to facilitate the same outcome, ie not using type2s (despite being a fact) - but it is much slower.

RVA, re: "I am getting expected result with this solution but I think this is not

Page 46: Incremental Loading

performance efficient as it is using sort, merge etc"; if the sort(s) are the main problem, you can do the sort on the database and tell SSIS that the set is sorted to avoid using two sort dataflow tasks - not sure if that will give you sufficient gains? The Merge join, as you say, will still be not great within SSIS.

Lastly - has anyone any experience of duplicated KEYS in the source table, that do not (yet) exist in the destination?

I am performing bulk-inserts after the update/insert evaluation. I have a minor concern that if I have a key in the source data, that the FIRST record will correctly INSERT, does the lookup then add this key to memory, so that when the second key arrives it knows to update?

Because, although I do not constrain the destination table, it will cause problems within the data (mini carteseans - *shudder*).

Do I need to be aware of any settings or the like? I am about to do a test-case now - and see what happens...

January 24, 2010 5:35 PM

Mandar said:

Hi Andy,

I want to load data incrementally from source (MySQL 5.2) to SQL Server 2008, using SSIS 2008, based on modified date. Somehow I am not able do it as MySQL doesn't support parameters. Need some help on this.

-regards, mandar

March 15, 2010 6:40 AM

Ramdas said:

Thank you andy for this tutorial. I am using SSIS 2008, the Lookup task interface has changed a little bit, when you click on edit on the lookup task, the opening screen is layed out differently.

March 25, 2010 9:46 AM

KK said:

In my source ID 3 Record has duplicated KEYS so i want first record Insert and Secode Record should be update in Destination table trough SSIS

Can any one help me to resovle this problem.

When I use SCD 2 type when it read record in target the id 3 record is not avlable in target so it’s treat for insert for second record also same.

So that record insert two time I don’t want like that I want to first record insert

Page 47: Incremental Loading

and scoend record of ID 3 Update.

So any way of resolve this problem .

ID       Name     Date  

1 Kiran 1/1/2010 12:00:00 AM

3 Rama 1/2/2010 12:00:00 AM

2 Dubai 1/2/2010 12:00:00 AM

3 Ramkumar 1/2/2010 12:00:00 AM

March 25, 2010 5:11 PM

Craig said:

I need to incrementally load data from Sybase to SQL.  There will be several hundred million rows.  Will this approach work OK with this scenario?

March 30, 2010 10:45 AM

andyleonard said:

Hi Craig,

  Maybe, but most likely not. This is one design pattern you can start with. I would test this, tweak it, and optimize like crazy to get as much performance out of your server as possible.

:{> Andy

March 30, 2010 10:52 AM

jpedroalmeida said:

Hy there from Portugal,

Andy, i am a starter in SSIS and i found this article very useful and straightforward in explanation with text and images...

Thanks a lot!!

Cheers

April 25, 2010 11:02 AM

JohnnyReaction said:

Hi Andy

I amended your script to deal with different datatypes (saves a lot of debugging

Page 48: Incremental Loading

in the Conditional Split Transformation Editor):

/*

This script assists with the creation of the Conditional Split "Changed Rows" condition

-- be sure your results aren't being truncated when you have a table with many columns

*/

--- BEGIN SCRIPT ---

USE master

GO

DECLARE @Filter varchar(max)

SET @Filter = ''

SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?'+

CASE WHEN c.system_type_id IN (35,104,167,175,231,239,241) THEN '""'

WHEN c.system_type_id IN (58,61) THEN '(DT_DBTIMESTAMP)"1900-01-01"'

ELSE '0' END

+ ':' + c.[name] + ')!=(ISNULL(Dest_' + c.[name] + ')?' +

CASE WHEN c.system_type_id IN (35,104,167,175,231,239,241) THEN '""'

WHEN c.system_type_id IN (58,61) THEN '(DT_DBTIMESTAMP)"1900-01-01"'

ELSE '0' END

+':Dest_' + c.[name] + ')) || '

FROM sys.tables t

INNER JOIN sys.columns c

ON t.[object_id] = c.[object_id]

WHERE SCHEMA_NAME( t.[schema_id] ) = 'dbo'

Page 49: Incremental Loading

AND t.[name] = 'DimUPRTable'

AND c.[is_identity] = 0

AND c.[is_rowguidcol] = 0

ORDER BY

c.[column_id]

SET @Filter = LEFT(@Filter, (LEN(@Filter) - 2))

SELECT @Filter

--SELECT

-- c.*

--FROM

-- sys.tables t

--JOIN

-- sys.columns c

-- ON t.[object_id] = c.[object_id]

--WHERE

-- SCHEMA_NAME( t.[schema_id] ) = 'dbo'

--AND t.[name] = 'DimUPRTable'

--AND c.[is_identity] = 0

--AND c.[is_rowguidcol] = 0

--ORDER BY

--c.[column_id]

--SELECT  

-- schemas.name AS [Schema]

-- ,tables.name AS [Table]

Page 50: Incremental Loading

-- ,columns.name AS [Column]

-- ,CASE WHEN columns.system_type_id = 34    

-- THEN 'byte[]'            

-- WHEN columns.system_type_id = 35    

-- THEN 'string'            

-- WHEN columns.system_type_id = 36    

-- THEN 'System.Guid'            

-- WHEN columns.system_type_id = 48    

-- THEN 'byte'            

-- WHEN columns.system_type_id = 52    

-- THEN 'short'            

-- WHEN columns.system_type_id = 56    

-- THEN 'int'            

-- WHEN columns.system_type_id = 58    

-- THEN 'System.DateTime'            

-- WHEN columns.system_type_id = 59    

-- THEN 'float'            

-- WHEN columns.system_type_id = 60    

-- THEN 'decimal'            

-- WHEN columns.system_type_id = 61    

-- THEN 'System.DateTime'            

-- WHEN columns.system_type_id = 62    

-- THEN 'double'            

-- WHEN columns.system_type_id = 98    

Page 51: Incremental Loading

-- THEN 'object'            

-- WHEN columns.system_type_id = 99    

-- THEN 'string'            

-- WHEN columns.system_type_id = 104  

-- THEN 'bool'            

-- WHEN columns.system_type_id = 106  

-- THEN 'decimal'            

-- WHEN columns.system_type_id = 108  

-- THEN 'decimal'            

-- WHEN columns.system_type_id = 122  

-- THEN 'decimal'            

-- WHEN columns.system_type_id = 127  

-- THEN 'long'            

-- WHEN columns.system_type_id = 165  

-- THEN 'byte[]'            

-- WHEN columns.system_type_id = 167  

-- THEN 'string'            

-- WHEN columns.system_type_id = 173  

-- THEN 'byte[]'            

-- WHEN columns.system_type_id = 175  

-- THEN 'string'            

-- WHEN columns.system_type_id = 189  

-- THEN 'long'            

-- WHEN columns.system_type_id = 231  

Page 52: Incremental Loading

-- THEN 'string'            

-- WHEN columns.system_type_id = 239  

-- THEN 'string'            

-- WHEN columns.system_type_id = 241  

-- THEN 'string'            

-- WHEN columns.system_type_id = 241  

-- THEN 'string'        

-- END AS [Type]

-- ,columns.is_nullable AS [Nullable]

--FROM              

-- sys.tables tables    

--INNER JOIN    

-- sys.schemas schemas

--ON (tables.schema_id = schemas.schema_id )    

--INNER JOIN    

-- sys.columns columns

--ON (columns.object_id = tables.object_id)  

--WHERE    

-- tables.name <> 'sysdiagrams'

-- AND tables.name <> 'dtproperties'

--ORDER BY

-- [Schema]

-- ,[Table]

-- ,[Column]

Page 53: Incremental Loading

-- ,[Type]

July 28, 2010 8:26 AM

Paul Klotka said:

Using T-SQL to do change detection.

I would not use a join to detect change because in the where clause you need to handle NULL values. For example if ColA in Source is NULL it doesn't matter what ColA is in the destination, the where clause will return false and not detect the change.

To get around this I use a union to detect change. Here is an example.

select ColId, ColA, ColB, ColC from Source

union

select ColId, ColA, ColB, ColC from Dest

This returns a distinct set of rows, including handling NULL values. All that is left is to determine if the ColId appears more than once in the set.

select ColId from (

select ColId, ColA, ColB, ColC from Source

union

select ColId, ColA, ColB, ColC from Dest

) x

group by ColId

having count(*) > 1

Now I have a list of keys which changed. I can take this list and sort it to use in a merge join in SSIS or I can use it as a subquery to join back to the Source table. See below.

select ColId, ColA, ColB, ColC from Source s

inner join (

select ColId from (

select ColId, ColA, ColB, ColC from Source

Page 54: Incremental Loading

union

select ColId, ColA, ColB, ColC from Dest

) x

group by ColId

having count(*) > 1

) y

on s.ColId = y.ColId

July 28, 2010 2:06 PM

Chhavi said:

Thanks for the good explanation and screenshots. I found this website to be extremly helpful and supportive.

Please let me know if I can learn something more from you and rest of the guys visiting this website, so that we can become better in SSIS and SQL server 2005 or 2008.

Please provide us similar articles so that we can through them and practice.

Thanks again Andy.

Long Live Andy :)

August 18, 2010 3:59 PM

Leave a Comment

Name (required)*

Your URL (optional)

Comments (required)*

Remember Me?Submit

Page 55: Incremental Loading

About andyleonard

Andy Leonard is an Architect with Molina Medicaid Solutions, SQL Server database and Integration Services developer, SQL Server MVP, PASS Regional Mentor (Southeast US), and engineer. He is a co-author of Professional SQL Server 2005 Integration Services and SQL Server MVP Deep Dives.

©2006-2010 SQLblog.comTM

Brought to you by Adam Machanic & Peter DeBetta

Contact Us   Privacy Statement


Top Related