implementation data ware house using ssis1.0

Implementing Data Ware house type 2 using SSIS

1

IMPLEMENTING DATA WARE HOUSE TYPE 2 USING SSIS

VERSION 1.1

IS-Trans & TH1.1-Group1

APM – Country IS Deepak Kumar Sharma [[email protected]]


2

Ver

sio

n Date Changed By Change Description

1.0 14-Feb-09 Deepak Kumar Sharma Document Created

1.1 24-Jun-09 Deepak Kumar Sharma Incorporated review comments


3

Table of Contents

1 Introduction _______________________________________________________ 4

2 Purpose of Data warehouse___________________________________________ 4

3 Glimpse of historical data ____________________________________________ 5

4 Hardware/ Software prerequisites ______________________________________ 6

5 Loading Data into the Staging tables ___________________________________ 6

6 Loading Data from staging tables to Dim tables using slowly changing dimensions with SSIS 2005 ________________________________________________________ 12

7 Appendix_________________________________________________________ 21


4

1 Introduction The purpose of a data warehouse is to provide information to managers and others

to help them make better business decisions.

As data enters the warehouse from various sources, it is assessed, cleansed

(misspellings, missing values, erroneous dates, etc, are fixed) and where necessary transformed. This process is known as ETL, an acronym for Extraction,

Transformation, and Loading.

Data is fed into a data warehouse in periodic batches, usually during off-peak hours.

For this reason, the data is essentially non-volatile and indexes can be updated every

time the warehouse changes.

Generally, there are three kinds of decision support applications: SQL-based, OLAP and data mining.

Reporting applications and applications that support ad hoc SQL queries allow

decision makers to find answers to questions that pertain to specific transactions and

similar data.

OLAP (online analytical processing) allows decision makers to examine aggregate

and summarized information from different points of view or dimensions and then to

slice-and-dice and drill-down into specifics where necessary. OLAP servers can be

relational databases (ROLAP databases with star or snowflake schemas),

multidimensional arrays (MOLAP hypercube) or hybrid stores (HOLAP). OLAP data is

organized according to the characteristics and qualities of the data instead of

business rules and concepts. The data is demoralized, aggregated and structured

logically as a three or higher dimensional cube.

Data mining is the process of exploring large sets of data using statistical or artificial

intelligence techniques to classify data, identify patterns or predict trends. Examples

of data mining algorithms are decision trees, K nearest neighbor analysis, genetic

clustering analysis, linear and logistic regression, neural networks, and association

analysis.

2 Purpose of Data warehouse

We need data ware house for solving following problems

• Data Integration:- If we want to migrate data into a heterogeneous data

environment then we need a DW. Data Warehouse serves not only as a repository for historical data but also as an excellent data integration

platform. The data in the data warehouse is integrated, subject oriented,


5

time-variant and non-volatile to enable you to get a 360° view of your

organization.

• Knowledge Discovery and Decision Support

Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge from large volumes of data. The data warehouse also

enables an Executive Information System (EIS). Executives typically could not be expected to sift through several different reports trying to get a holistic picture of the

organization’s performance and make decisions.

• Advanced Reporting & Analysis

The data warehouse is designed specifically to support querying, reporting and

analysis tasks. The data model is flattened (denormalized) and structured by subject areas to make it easier for users to get even complex summarized information with a

relatively simple query and perform multi-dimensional analysis. Most reporting, data

analysis, and visualization tools take advantage of the underlying data model to provide powerful capabilities such as drilldown,

roll-up, drill-across and various ways of slicing and dicing data. The flattened data model makes it much easier for users to understand the data and

write queries rather than work with potentially several hundreds of tables and write

long queries with complex table joins and clauses.

• Performance

Finally, the performance of transactional systems and query response time make the case for a data warehouse. The transactional systems are meant to do just that –

perform transactions efficiently – and hence, are designed to optimize frequent

database reads and writes. The data warehouse, on the other hand, is designed to optimize frequent complex querying and analysis. Some of the ad-hoc queries and

interactive analysis, which could be performed in few seconds to minutes on a data

warehouse could take a heavy toll on the transactional systems and literally drag their performance down.

3 Glimpse of historical data

In Type 2 Slowly Changing Dimension, a new record is added to the table to

represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.

Example

Customer Key Name Address Current Flag

1 ABC Ltd Add1 Y

After ABC Ltd moved from Add1 to Add2, we add the new information with current

flag as Y as a new row into the table:

Customer Key Name Phone Current Flag

1 ABC Ltd Add1 N


6

2 ABC Ltd Add2 Y

Type 2 slowly changing dimension should be used when it is necessary for the data

ware house to track historical changes.

4 Hardware/ Software prerequisites

Hardware

Platform

Developing

platform

Operating

System Database

HP/IBM etc

Microsoft visual Studio 2005

Windows Server 2003

SP2/Windows

XP/NT

MS SQL 2005

5 Loading Data into the Staging tables As per data ware house first we take the data from source (Oracle or AS400 or SQL Server) into the staging tables so we can clean up the data. Load the bulk data into this table. This loading can be much faster (compared to loading directly into the

target table) because the staging table has no indexes or constraints on it. More importantly, while the new data is being loaded, the existing data is fully available

for all transactions without any impact, because the data load is taking place on a

separate staging table.

These staging tables are used to hold temporary data i.e data with in the temporary

table will be automatically get deleted when we issue commit

Step 1> Create a new project (select business intelligence projects->Integration

projects) by File->New->Project which has been shown in fig 1


7

Fig 1 Creating a project Step 2>Create a New Data Source which shown in fig 2.


8

Fig 2 creating a data source

Step 3> Create a new Package.

Step 4> Create a new SQL task for updating status of the job


9

Step 5> Create a new Data Flow task for inserting data into STG area from AS400

which shown in fig 3

Fig 3 creating a data flow task


10

Step 6> Take a derived column for error handling. If any erroneous data comes that

rows will be redirected into TBL_ERROR_LOG

Fig 4 logging the error


11

Step 7> Create the Data mapping.

Step 9> Take again a SQL Task for marking job as Success with all job detail

like JOB_RUN_DATE,JOB_FINISH_TIME,NO_ROWS_INSERTED etc..

Fig 5 Data mapping


12

Fig 6 complete flow for data loading from AS400 to staging table

6 Loading Data from staging tables to Dim tables us ing slowly changing dimensions with SSIS 2005

Step 1> Take a new Package.

Step 2> Check the correspondent staging data loaded successfully from

source (AS400 or Oracle).

Step 3> Take a Slowly changing dimension wizard.

a) Before you run the SCD transformation wizard, you should configure your data source and destination. For this example, I used a OLEDB as the data source and a data warehouse (which also happens to be a SQL Server

database) as the destination. You could also use the SCD transformation for

data sources and destinations other than SQL Server.


13

b) SSIS projects are normally developed through Business Intelligence Development Studio (BIDS). First, create an SSIS project (File→New→Project, and choose Integration Services Project template), then

add a data flow task to the control flow. Activate the data flow tab, then

identify your data source and destination within BIDS. Next drag the SCD transformation to the Data Flow designer and drag the green arrow, which

denotes output from your data source, to the SCD transformation. Then

double-click SCD transformation to activate the SCD wizard. c) The initial screen simply welcomes you and informs you of the Slowly

Changing Dimensions wizard's capabilities in SQL Server Integration Services

2005. You can choose to omit this screen in the future. The next screen allows you to pick one or multiple business keys. A business key uniquely

identifies each record in the table that is used to populate your dimension

table. Business keys are presumed to be static – once the key is assigned to the record it should not change. Each dimension table could have one or

multiple business keys.

Typically your dimension will also have a surrogate key – a column not found in the data source, but rather added to the dimension table during extraction,

transformation and loading (ETL) process. Typically, surrogate keys are

implemented as identity columns – these columns have no business meaning, but they uniquely identify each dimension record.

d) Once you have identified the business key it's time to choose the columns (referred to as attributes) you wish to maintain within your SCD. The wizard

gives you three options: o Fixed attribute – values in this column should not change. If SSIS

detects a change in the value of a column tagged as a fixed attribute,

it could raise an exception and fail the execution of the package. You'll get an option to configure your package as such on the next screen.

Detecting changes in values of a fixed attribute isn't really part of slowly changing dimensions implementation, but this functionality is

useful in case you want to identify problems in your source data while

your SSIS package is executing. o Changing attribute – is a type 1 SCD; the values of a changing

attribute are simply overwritten. History of changes is not recorded.

o Historical attribute – this is type 2 SCD. The modified value is saved in a new record and the existing record is tagged as expired.

For example, in Fig 7 (Slowly Changing Dimension Columns), the screenshot

tags AccountDescription column as a changing attribute, AccountType as a

historical attribute and Operator as a fixed attribute.


14

Fig 7. Change types: fixed attribute, changing attribute and historical attribute.

In Fig 8 (Historical Attribute Options), you can configure details of your

implementation. You can advise SSIS to fail the package when the changes in fixed attributes are detected, and you can change all matching records,

including the expired records when modifications are detected in the changing

attribute. See my brief discussion earlier regarding the first option.

The second option is useful for attributes that will have duplicate values repeated across multiple dimension members. For example, the same account

type will apply to multiple accounts. If the Account Type is modified from

"Liabilities" to "Current Liabilities," you might wish to apply this change to all accounts that have this type.

e) Next, the wizard allows you to choose how you want to identify current and expired records. You have two options: Tag the record as expired (or

obsolete, outdated or another adjective of your choosing) or identify the expired records by adding the date of expiry to such records. If you use the

latter option, you can use one of SSIS global variables to determine the date and time value for updating the expired dimension record. I prefer using a

flag to identify the current record. All other records are considered expired.

However, you can choose from a couple of options.


15

Fig 8. Configuring details of your implementation.

f) The screen in Fig 9 (Data Flow Task) allows you to configure support for inferred dimension members. Inferred members are created when you load a record into the fact table and it has no corresponding dimension record. For

correct analyses, each fact table record must be associated with a record in a dimension table.

There are multiple ways of handling this in a data warehouse. The SCD wizard

allows you to create an inferred member record with all dimension attributes

set to NULL. It is often more appropriate to have a special "unknown" member in your dimension (perhaps with the dimension surrogate key equal

to -999 or another value that stands out) as opposed to creating inferred

members. In this example we're only concerned with type 1 and type 2 SCD and not with the inferred members. So let's uncheck the default selection and

click next.

That is all the information the wizard needs in order to come up with the

entire data flow required to maintain the SCD. As you can see from the screenshot, the wizard does quite a bit of work for you:


16

Fig 9. Use the Data Flow Task to configure support for inferred dimension

members.

g) You can customize the data flow the wizard generated to suit your application's needs. But first, let's examine the newly created

transformations: The OLE DB command transformation updates the

record_expired attribute if AccountType value has been altered. This transformation receives input from the Derived Column transformation that

determines the value to insert into the record_expired column. If the value in

the data source changes, SSIS creates a new record so the Union All transformation combines this record with the existing data source records.

The OLE DB Command 1 transformation updates the value of AccountDescription – the changing attribute based on the business key

(AccountCodeAlternateKey) with a command similar to the following:

Finally, the Insert Destination transformation populates the destination table

DimAccount.


17

Fig 10 Inserting data into Dim tables

h) Now, let's update a few members in the transactional database and then run the package to see if the SCD wizard maintains the changing dimensions correctly. Indeed we can confirm that changing AccountType attribute value

from "assets" to "wonderful assets" triggers the creation of a new record in

the account dimension and tags the newly created record as current:

Changes in AccountDescription column, on the other hand, do not create a new row – they simply overwrite the existing value:

As I mentioned at the beginning of this tip, there are other methods for maintaining

slowly changing dimensions in your data warehouse that are not available through the wizard. However, the ability to implement type 1 and type 2 slowly changing

dimensions by virtue of a few clicks within SSIS certainly could speed up ETL

development


18

Step 4> Click Derived Column End date and set the expired date.

Fig 11 Setting up the derived column


19

Step 5> Modify the update query for updating end date.

Step 6> Modify the Start_date.

Fig 12 Setting up the date filed


20

TRUNCATE TABLE STG_BIPCC_RANK

Fig 13 Setting up the derived date

Step 6> Now click the Ok button then now you will able to run the package.

Fig 14 final view of the package


21

Step 7> Now click the Run button or you can also use F5 for getting the entire

result.

7 Appendix Some Important terms used in the implementation:

• Changing Attributes Updates Output: The record in the lookup table is updated. This output is used for changing attribute rows.

• Fixed Attribute Output: The values in rows that must not change do not

match values in the lookup table. This output is used for fixed attribute rows.

• Business Keys: The Slowly Changing Dimension transformation requires at least one business key column. The Slowly Changing Dimension transformation does not support null business keys. If the data include rows

in which the business key column is null, those rows should be removed from

the data flow. You can use the Conditional Split transformation to filter rows whose business key columns contain null values.

• Inferred Member: Rows for inferred dimension members are inserted. This

output is used for inferred member rows.

• Historical Attributes: The lookup table contains at least one matching row.

The row marked as “current” must now be marked as "expired". This output

is used for historical attribute rows.

• Type 1: Update the columns in the dimension row without preserving any

change history.

• Type 2: Preserve the change history in the dimension table and create a new

row when there are changes.

• Type 3: Some combination of Type 1 and Type 2, usually maintaining multiple instances of a column in the dimension row; e.g. a current value and

one or more previous values

• Natural Key: The unique source system key that identifies the entity; e.g.

CustomerID in the source system would be called nk_CustomerID in the

dimension.

• Surrogate Key: An identity value used to uniquely identify the row in the

dimension. For a given natural key there will be an instance of a row for each Type 2 change so the natural key will not be unique in the dimension.

implementation data ware house using ssis1.0

Documents