working with flat file source

33
Working with Flat File Source, LookUp & Filter Transformation This tutorial shows the process of creating an Informatica PowerCenter mapping and workflow which pulls data from Flat File data sources and use LookUp and Filter Transformation. For the demonstration purpose lets consider a flat file with the list of existing and potential customers. We need to create a mapping which loads only the potential customers but not the existing customers to a relational target table. While creating the mapping we will cover the following. Create a mapping which reads from a flat file and creates a relational table consisting of new customers Analyze a fixed width flat file Configure a Connected Lookup transformation Use a Filter transformation to exclude records from the pipeline. I. Connect to the Repository 1. Connect to the repository. 2. Open the folder where you need the mapping built. II. Analyze the source files 1. Import the flat file definition (say Nielsen.dat) into the repository. 2. Select SOURCES | IMPORT FROM FILE from the menu. 3. Select Nielsen.dat from the source file directory path. Hint : Be sure to set the Files of type: to All files (*.*) from the pull-down list, before clicking on OK. 1. Set the following options in the Flat File Wizard:

Upload: dpakkumararora

Post on 29-Dec-2015

55 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Working With Flat File Source

Working with Flat File Source, LookUp & Filter Transformation

This tutorial shows the process of creating an Informatica PowerCenter mapping and workflow which pulls data from Flat File data sources and use LookUp and Filter Transformation.

For the demonstration purpose lets consider a flat file with the list of existing and potential customers. We need to create a mapping which loads only the potential customers but not the existing customers to a relational target table. 

While creating the mapping we will cover the following.

Create a mapping  which reads from a flat file and creates a relational table consisting of new

customers

Analyze a fixed width flat file

Configure a Connected Lookup transformation

Use a Filter transformation to exclude records from the pipeline.

I. Connect to the Repository

1. Connect to the repository.

2. Open the folder where you need the mapping built.

II. Analyze the source files

1. Import the flat file definition  (say Nielsen.dat) into the repository.

2. Select SOURCES | IMPORT FROM FILE from the menu.

3. Select Nielsen.dat from the source file directory path. Hint : Be sure to set the Files of type:

to All files (*.*) from the pull-down list, before clicking on OK.

1. Set the following options in the Flat File Wizard:

Page 2: Working With Flat File Source

2. Select Fixed Width and check the Import field names from first line box. This option

will extract the field names from the first record in the file.

3. Create a break line or separator between the fields.

4. Click on NEXT to continue.

Page 3: Working With Flat File Source

5. Refer Appendix A to see the structure of NIELSEN.DAT flat file.

4. Change field name St to State and Code to Postal_Code. Note : The physical data file will be

present on the Server. At runtime, when the Server is ready to process the data (which is

now defined by this new source definition called Nielsen.dat) it will look for the flat file that

contains the data in Nielsen.dat.

5. Click Finish.

6. Name the new source definition NIELSEN. This is the name that will appear as metadata in

the repository, for the source definition.

III. Design the Target Schema

Assumption: The target table does not exist in the database

1. Switch to Target Designer .

2. Select EDIT | CLEAR if necessary to clear the workspace. Any objects you clear from the

workspace will still be available for use in Designer’s Navigator Window, in the Targets node.

3. Drag the NIELSEN source definition from the Navigator Window into the workspace to

automatically create a target table definition. You have just created a target definition based

on the structure of the source file definition. You now need to edit the target table definition.

4. Rename the table as Tgt_New_Cust_x.

Page 4: Working With Flat File Source

5. Enter the field names as mentioned in the Figure below .Change the Key Type for

Customer_ID to Primary Key. The Not Null option will automatically be checked. Save the

repository.

6. The target table definition should look like this

Page 5: Working With Flat File Source

7. Create the physical table in the Oracle Database so that you can load data. Hint : From the

Edit table properties in Target designer, change the database type to Oracle.

IV. Create the mapping and drag the Source and Target

1. Create a new mapping  with the name M_New_Customer_x

2. Drag the source into the Mapping Designer workspace. The SourceQualifier should be

automatically created.

3. Rename the Source Qualifier as SQ_NIELSEN_x

4. Drag the target (Tgt_New_Cust_x) into the Mapping Designer workspace

V. Create a Lookup Transformation

1. Select TRANSFORMATION | CREATE.

2. Select Lookup from the pull-down list.

Page 6: Working With Flat File Source

3. Name the new Lookup transformation Lkp_New_Customer_x.

4. You need to identify the Lookup table in the Lookup transformation. Use the CUSTOMERS

table from the source database to serve as the Lookup table and import it from the database.

5. Select Import to import the Lookup table.

6. Enter the ODBC Data Source, Username, Owner name, and Password for the Source

Database and Connect.

7. In the Select Tables box, expand the owner name until you see a TABLES listing.

8. Select the CUSTOMERS table.

9. Click OK.

Page 7: Working With Flat File Source

10. Click Done to close the Create Transformation dialog box.

Note : All the columns from

the CUSTOMERS table are seen in the transformation.

11. Create an input-only port in Lkp_New_Customer_x to hold the Customer_Id value, coming

from SQ_NIELSEN_x .

1. Highlight the Cust_Id column from the SQ_NIELSEN_x

2. Drag/drop it to Lkp_New_Customer_x.

3. Double-click on Lkp_New_Customer_x to edit the Lookup transformation.

4. Click the Ports tab, make Cust_Id an input-only port.

5. Make CUSTOMER_Id a lookup and output port.

12. Create the condition for lookup.

Page 8: Working With Flat File Source

1. Click the Condition Tab.

2. Click on the    icon.

3. Add the lookup condition: CUSTOMER_ID = Cust_Id. 

Note : Informatica takes its ‘best guess’ at the lookup condition you intend, based on

data type and precision of the ports now in the Lookup transformation.

13. Click the Properties tab.

14. At line 6 as shown in the figure below, note the Connection Information. 

VI. Create a Filter Transformation

Page 9: Working With Flat File Source

1. Create a Filter transformation that will filter through those records that do not match the

lookup condition and name it Fil_New_Cust_x.

2. Drag all the ports from Source Qualifier to the new Filter. The next step is to create an input-

only port to hold the result of the lookup.

3. Highlight the CUSTOMER_ID port from Lkp_New_Customer_x .

4. Drag it to an empty port in Fil_New_Cust_x .

5. Double-click Fil_New_Cust_x to edit the filter.

6. Click the Properties tab.

7. Enter the filter condition: ISNULL(CUSTOMER_ID). This condition will allow only those

records whose value for CUSTOMER_ID is = null, to pass through the filter.

8. Click OK twice to exit the transformation.

9. Link all ports except CUSTOMER_ID from the Filter to the Target table. 

Hint : Select the LAYOUT | AUTOLINK menu options, or right-click in the workspace

background, and choose Auto link. In the Auto link box, select the Name radio button. This

will link the corresponding columns based on their names. 

10. Click OK.

11. Save the repository.

12. Check the Output window to verify that the mapping is valid.

13. Given below is the final mapping.

Page 10: Working With Flat File Source

VII.    Create the Workflow and Set Session Tasks Properties

1. Launch the Workflow Manager and connect to the repository.

2. Select your folder.

3. Select WORKFLOWS | CREATE to create a Workflow as wf_New_Customer_x.

4. Select TASKS | CREATE to d create a Session Task as s_New_Customer_x.

5. Select the M_New_Customer_x mapping.

6. Set the following options in the Session Edit Task:

1. Select the Properties tab. Leave all defaults.

7. Select the Mapping tab.

1. Select the Source folder. On the right hand side, under Properties, verify the attribute

settings are set to the following:

1. Source Directory path = $PMSourceFileDir\

2. File Name = Nielsen.dat (Use the same case as that present on the server)

3. Source Type: Direct 

Note : For the session you are creating, the Server needs the exact path, file

name and extension for the file as it resides on the Server, to use at run time

Page 11: Working With Flat File Source

2. Click on the Set File Properties button.

3. Click on Advanced.  

4. Check the Line sequential file format check box.

Page 12: Working With Flat File Source

5. Select the Targets folder. 

1. Under Connections on the right hand side, Select the value of 

Target Relational Database Connection.

6. In the Transformations Folder, Select the Lkp_New_Customer transformation.

1. On the right hand side, in Connections, Select the Relational Database

Connection for the Lookup Table. Figure

Page 13: Working With Flat File Source

8. Run the Workflow.

9. Monitor the Workflow .

10. View the Session Details and Session Log.

11. Verify the Results from the target table by running the query SELECT * FROM

Tgt_New_Cust_x;

SCD Type 1 Implementation using Informatica PowerCenter

Unlike SCD Type 2, Slowly Changing Dimension Type 1 do not preserve any history versions of

data. This methodology overwrites old data with new data, and therefore stores only the most

current information. In this article lets discuss the step by step implementation of SCD Type 1 using

Informatica PowerCenter.

The number of records we store in SCD Type 1 do not increase exponentially as this methodology

Page 14: Working With Flat File Source

overwrites old data with new data  Hence we may not need theperformance improvement

techniques used in the SCD Type 2 Tutorial.

Understand the Staging and Dimension Table.

Slowly Changing Dimension Series

Part I : SCD Type 1.

Part II : SCD Type 2.

Part III : SCD Type 3.

Part IV : SCD Type 4.

Part V : SCD Type 6.

For our demonstration purpose, lets consider the CUSTOMER Dimension. Below are the detailed

structure of both staging and dimension table.

Staging Table

In our staging table, we have all the columns required for the dimension table attributes. So no other

tables other than Dimension table will be involved in the mapping. Below is the structure of our

staging table.

Key Points

1. Staging table will have only one days data. Change Data Capture   is not in scope.

2. Data is uniquely identified using CUST_ID.

3. All attribute required by Dimension Table is available in the staging table

Dimension Table

Here is the structure of our Dimension table.

Page 15: Working With Flat File Source

Key Points

1. CUST_KEY is the surrogate key.

2. CUST_ID is the Natural key, hence the unique record identifier.

Mapping Building and Configuration

Step 1

Lets start the mapping building process. For that pull the CUST_STAGE source definition into

the mapping designer.

Step 2

Now using a LookUp Transformation fetch the existing Customer columns from the dimension table

T_DIM_CUST. This lookup will give NULL values if the customer is not already existing in the

Dimension tables.

LookUp Condition : IN_CUST_ID = CUST_ID

Return Columns : CUST_KEY

Page 16: Working With Flat File Source

Step 3

Use an Expression Transformation to identify the records for Insert and Update using below

expression. 

o INS_UPD :- IIF(ISNULL(CUST_KEY),'INS', 'UPD')    

Additionally create two output ports.

o CREATE_DT :- SYSDATE

o UPDATE_DT :- SYSDATE

See the structure of the mapping in below image.

Step 4

Map the columns from the Expression Transformation to a Router Transformation and create two

groups (INSERT, UPDATE) in Router Transformation using the below expression. The mapping will

look like shown in the image.

o INSERT :- IIF(INS_UPD='INS',TRUE,FALSE)

Page 17: Working With Flat File Source

o UPDATE :- IIF(INS_UPD='UPD',TRUE,FALSE)

INSERT Group

Step 5

Every records coming through the 'INSERT Group' will be inserted into the Dimension table

T_DIM_CUST. 

Use a Sequence generator transformation to generate surrogate key CUST_KEY as shown in below

image. And map the columns from the Router Transformation to the target as shown below image.

Note : Update Strategy is not required, if the records are set for Insert.

Page 18: Working With Flat File Source

UPDATE Group

Step 6

Records coming from the 'UPDATE Group' will update the customer Dimension with the latest

customer attributes. Add an Update Strategy Transformation before the target instance and set it as

DD_UPDATE. Below is the structure of the mapping.

We are done with the mapping building and below is the structure of the completed mapping.

Workflow and Session Creation

There is not any specific properties required to be given during the session configuration.

Page 19: Working With Flat File Source

Below is a sample data set taken from the Dimension table T_DIM_CUST.

Initial Inserted Value for CUSTI_ID 1003

Updated Value for CUSTI_ID 1003

Hope you guys enjoyed this. Please leave us a comment in case you have any questions of

difficulties implementing this.

Slowly Changing Dimension Type 2 also known SCD Type 2 is one of the most commonly used type

of Dimension table in a Data Warehouse.  SCD Type 2 dimension loads are considered to be

complex mainly because of the data volume we process and because of the number of

transformation we are using in the mapping. Here in this article, we will be building an Informatica

PowerCenter mapping to load SCD Type 2 Dimension.

Understand the Data Warehouse Architecture

Before we go to the mapping design, Lets understand the high level architecture of our Data

Warehouse.

Slowly Changing Dimension Series

Part I : SCD Type 1.

Part II : SCD Type 2.

Page 20: Working With Flat File Source

Part III : SCD Type 3.

Part IV : SCD Type 4.

Part V : SCD Type 6.

Here we have a staging schema, which is loaded from different data sources after the required data

cleansing. Warehouse Tables are loaded from the staging schema directly. Both staging tables and

the warehouse tables are in two different schemas with in a single database instance.

Understand the Staging and Dimension Table.

Staging Table

In our staging table, we have all the columns required for the dimension table attributes. So no other

tables other than Dimension table will be involved in the mapping. Below is the structure of our

staging table.

CUST_ID

CUST_NAME

ADDRESS1

ADDRESS2

CITY

STATE

ZIP

Key Points :

1. Staging table will have only one days data.

2. Data is uniquely identified using CUST_ID.

3. All attribute required by Dimension Table is available in the staging table.

Dimension Table

Here is the structure of our Dimension table.

CUST_KEY

AS_OF_START_DT

AS_OF_END_DT

Page 21: Working With Flat File Source

CUST_ID

CUST_NAME

ADDRESS1

ADDRESS2

CITY

STATE

ZIP

CHK_SUM_NB

CREATE_DT

UPDATE_DT

Key Points :

1. CUST_KEY is the surrogate key.

2. CUST_ID, AS_OF_END_DT is the Natural key, hence the unique record identifier.

3. Record versions are kept based on Time Range using AS_OF_START_DT,

AS_OF_END_DATE

4. Active record will have an AS_OF_END_DATE value 12-31-4000

5. Checksum value  of all dimension attribute columns are stored into the column

CHK_SUM_NB

Mapping Building and Configuration

Now we understand the ETL Architecture, Staging Table, Dimension Table and the design

considerations, we can go to the mapping development. We are splitting the mapping development

into six steps.

1. Join Staging Table and Dimension Table

2. Data Transformation

o Generate Surrogate Key

o Generate Checksum Number

o Other Calculations

3. Identify Insert/Update

4. Insert the new Records

Page 22: Working With Flat File Source

5. Update(Expire) the Old Version

6. Insert the new Version of Updated Record

1. Join Staging Table and Dimension Table

We are going to OUTER JOIN both the Staging (Source) Table and the Dimension (Target) Table

using the SQL Override below. An OUTER Join gives you all the records from the Staging table and

the corresponding records from Dimension table. if it is there is no corresponding record in the

Dimension table, it returns NULL values for the Dimension table columns.

SELECT

--Columns From Staging (Source) Tables CUST_STAGE.CUST_ID,

CUST_STAGE.CUST_NAME,

CUST_STAGE.ADDRESS1,

CUST_STAGE.ADDRESS2,

CUST_STAGE.CITY,

CUST_STAGE.STATE,

CUST_STAGE.ZIP,

--Columns from Dimension (Target) Tables.

T_DIM_CUST.CUST_KEY,

T_DIM_CUST.CHK_SUM_NB

FROM CUST_STAGE LEFT OUTER JOIN T_DIM_CUST

ON CUST_STAGE.CUST_ID = T_DIM_CUST.CUST_ID  -- Join On the Natural Key

AND T_DIM_CUST.AS_OF_END_DT = TO_DATE('12-31-4000','MM-DD-YYYY') – Get the active

record.

Page 23: Working With Flat File Source

2.  Data Transformation

Now map the columns from the Source Qualifier to an Expression Transformation. When you map

the columns to the Expression Transformation, rename the ports from Dimension Table with

OLD_CUST_KEY, CUST_CHK_SUM_NB and add below expressions.

Page 24: Working With Flat File Source

Generate Surrogate Key : A surrogate key will be generated for each and every record

inserted in to theDimension table

o CUST_KEY : Is the surrogate key, This will be generated using a Sequence

Generator Transformation

Generate Checksum Number : Checksum number of all dimension attributes. Difference in

the Checksum value between the incoming and Checksum of the Dimension table record will

indicate a changed column value. This is an easy way to identify changes in the columns

than comparing each and every column.

o CHK_SUM_NB : MD5(TO_CHAR(CUST_ID) || CUST_NAME || ADDRESS1 ||

ADDRESS2 || CITY || STATE || TO_CHAR(ZIP))

Other Calculations :

o Effective Start Date : Effective start date of the Record

AS_OF_START_DT :  TRUNC(SYSDATE)

o Effective end date  : Effective end date of the Record,   

AS_OF_END_DT : TO_DATE('12-31-4000','MM-DD-YYYY')

Page 25: Working With Flat File Source

o Record creation date : Record creation timestamp, this will be used for the records

inserted

CREATE_DT :  TRUNC(SYSDATE)

o Record updating date : Record updating timestamp, this will be used for records

updated.

UPDATE_DT :  TRUNC(SYSDATE)

3. Identify Insert/Update

In this step we will identify the records for INSERT and UPDATE.

INSERT : A record will be set for INSERT if the record is not exist in the Dimension Table,

We can identify the New records if  OLD_CUST_KEY is NULL, which is the column from the

Dimension table

UPDATE : A record will be set for UPDATE, if the record is already existing in the Dimension

table and any of the incoming column from staging table has a new value.  If the column

OLD_CUST_KEY is not null and the Checksum of the incoming record is different from the

Checksum of the existing record (OLD_CHK_SUM_NB <> CHK_SUM_NB), the record will

be set for UPADTE

o Following expression will be used in the Expression Transformation

port INS_UPD_FLG shown in the previous step

o INS_UPD_FLG : IIF(ISNULL(OLD_CUST_KEY), 'I', IIF(NOT

ISNULL(OLD_CUST_KEY) AND OLD_CHK_SUM_NB <> CHK_SUM_NB, 'U'))

Now map all the columns from the Expression Transformation to a Router and add two groups as

below

o INSERT : IIF(INS_UPD_FLG = 'I', TRUE, FALSE)

o UPDATE : IIF(INS_UPD_FLG = 'U', TRUE, FALSE)

Page 26: Working With Flat File Source

4. Insert The new Records

Now map all the columns from the ‘INSERT’ group to the Dimension table instance T_DIM_CUST.

While mapping the columns, we don’t need any column named OLD_, which is pulled from the

Dimension table.

Page 27: Working With Flat File Source

5. Update(Expire) the Old Version

The records which are identified for UPDATE will be inserted into a temporary table

T_DIM_CUST_TEMP. These records will then be updated into T_DIM_CUST as a post session

SQL.  You can learn more about this performance improvement technique from one of our previous

post.

We will be mapping below columns from ‘UPDATE’ group of the Router Transformation to the target

table. To update(expire) the old record we just need the columns below list.

o OLD_CUST_KEY : To uniquely identify  the Dimension Column.

o UPDATE_DATE : Audit column to know the record update date.

o AS_OF_END_DT : Record will be expired with previous days date.

While we map the columns, AS_OF_END_DT will be calculated

as ADD_TO_DATE(TRUNC(SYSDATE),'DD',-1) in an Expression Transformation. Below image

gives the picture of the mapping.

6. Insert the new Version of Updated Record

The records which are identified as UPDATE will have to have a new(active) version inserted.  Map

all the ports from the ‘UPDATE’ group of the Router Transformation to target instance

Page 28: Working With Flat File Source

T_DIM_CUST. While mapping the columns, we don’t need any column named OLD_, which is

pulled from the Dimension table.

Workflow and Session Creation

During the session configuration process, add the below SQL as part of the Post session SQL

statement as shown below. This correlated update SQL will update the records in T_DIM_CUST

table with the values from T_DIM_CUST_TEMP. Like we mentioned previously, this is

a performance improvement technique used to update huge tables.

UPDATE T_DIM_CUST SET

(T_DIM_CUST.AS_OF_END_DT,

T_DIM_CUST.UPDATE_DT) =

(SELECT

T_DIM_CUST_TEMP.AS_OF_END_DT,

T_DIM_CUST_TEMP.UPDATE_DT

FROM T_DIM_CUST_TEMP

WHERE T_DIM_CUST_TEMP.CUST_KEY = T_DIM_CUST.CUST_KEY) WHERE EXISTS

(SELECT 1

Page 30: Working With Flat File Source

TARGET UPDATE OVERRIDE - INFORMATICAWhen you used an update strategy transformation in the mapping or specified the "Treat Source Rows As" option as update, informatica integration service updates the row in the target table whenever there is match of primary key in the target table found.

The update strategy works only 

when there is primary key defined in the target definition.

When you want update the target table based on the primary key.

What if you want to update the target table by a matching column other than the primary key? In this case the update strategy wont work. Informatica provides feature, "Target Update Override", to update even on the columns that are not primary key.

You can find the Target Update Override option in the target definition properties tab. The syntax of update statement to be specified in Target Update Override is

UDATE TARGET_TABLE_NAME

SET TARGET_COLUMN1 = :TU.TARGET_PORT1,

[Additional update columns]

WHERE TARGET_COLUMN = :TU.TARGET_PORT

AND [Additional conditions]

Here TU means target update and used to specify the target ports.

Example: Consider the employees table as an example. In the employees table, the primary key is employee_id. Let say we want to update the salary of the employees whose employee name is MARK. In this case we have to use the target update override. The update statement to be specified is

UPDATE EMPLOYEES

SET SALARY = :TU.SAL

WHERE EMPLOYEE_NAME = :TU.EMP_NAME

Page 31: Working With Flat File Source

Target Update Override in Informatica

                                      By default, the Informatica Server updates targets based on key values. However, you can override the default UPDATE statement for each target in a mapping. You might want to update the target based on non-key columns.                                    For a mapping without an Update Strategy transformation, configure the session to mark source records as update. If your mapping includes an Update Strategy transformation, the Target Update option only affects source records marked as update. The Informatica Server processes all records marked as insert, delete, or reject normally. When you configure the session, mark source records as data-driven. The Target Update Override only affects source rows marked as update by the Update Strategy transformation. Overriding the WHERE Clause 

 Default:              UPDATE T_EMP_UPDATE_OVERRIDE SET ENAME = :TU.ENAME, JOB = :TU.JOB, SAL = :TU.SAL WHERE ENAME = :TU.ENAME                        You can override the WHERE clause to include non-key columns. For example, you might want to update records for employees named Smith only. To do this, you edit the WHERE clause as follows: 

UPDATE T_EMP_UPDATE_OVERRIDE SET EMPNO = :TU.EMPNO, ENAME = :TU.ENAME, JOB = :TU.JOB, SAL = :TU.SAL  where  ENAME = :TU.ENAME AND ENAME='SMITH'                       Entering a Target Update Statement             Follow these instructions to create an update statement. To enter a target update statement: 1. Double-click the title bar of a target instance. 2. Click Properties. 4. Click the arrow button in the Update Override field. 5. The SQL Editor displays. 5. Select Generate SQL. The default UPDATE statement appears. 6. Modify the update statement. You can override the WHERE clause to include non-key columns. 7. Click OK. NOTES:

1. One more thing i want to say that is :TU is a reserved keyword in Informatica to be used to match target port names with target table's column name.2.The general error when we are doing this is as follows

"TE_7023 Transformation Parse Fatal Error; transformation stopped... error constructing sql statement".

        Check the following to Solve This..

 Override Statement Once

 You have to keep a space before the :TU.

Page 32: Working With Flat File Source