datastage experiments

GETTING STARTED WITH DATASTAGE

Opening Virtual machine:

1) Run Datastage shortcut.2) Goto action menu in menu bar and select “Ctrl+alt+delete”.3) Give the login password as “P@ssw0rd”. Press “OK”.4) Wait for 5 min to get loaded with all the services.

NOTE: Don’t move the mouse cursor very often and don’t open the Internet Explorer as it makes the services slower.

To check whether all the services are running or not:

1) Goto Run.2) Type “services.msc”3) Press Enter.4) Check whether “IBM websphere” service is started or not.

To Cleanup temporary Files

1) Run Cleanup.exe2) Click the button cleanup3) Wait for some time until all the temporary files get cleared.4) Close.

Opening Designer client(Infosphere Data stage and quality stage):

1) Run “Designer client.exe”.2) Enter the username and password +ok.

Exercise-1 : Loading data from oltpsrc file to a dwhtarget file

Step 1:

File->New->Parallel Job.

Create a project in the repository by right clicking on dtstage1 and creating a new folder.

Name that folder.

Goto file->sequential file on palette

Drag and Drop the sequential file option twice to the work area.

Goto general->link on palette

Connect two sequential files by using link in work area(like drawing arrow in paint).

sequential_file(oltp) -> sequential_file(DWH)

copying the contents from oltp to DWH using flat file.

Step 2:

Create a txt file named “src.txt”.

Type some records with the structure (eno,ename,sal)

Rename sequential_File_o and sequential_File_1 as ‘oltpsrc’ and ‘dwhtarget’ respectively.

Step 3:

Setting oltpsrc properties

Double click ‘oltpsrc’ file on the work area

Set the properties as follows

File: Location of the source file

First Line is Column Name:Set True if first line of src file has column names else False

Set Format as follows:

Final delimeter = end(represents end of file)

Delimeter= Set the delimeter that you have used in the src file for separating each field

Quote=single|double|none as per the usage in src file fields.

Define Column name and datatype

Step 4: Setting ‘dwhtarget’ file properties

File=path of target file

File Update Mode=Overwrite (overwrites the target file if exists)|Create(creates a new file)|Append(append to the target file)

First Line is Column Names=True (treats first line of your src file as column names and skips the first line)|False (Loads the first line to the target file)

Step 5: Save Your Project:

Goto file-> save as

Item name: Project name

Folder Path: Path of your Project Folder

Step 6: Compiling Project:

Click the compile button on the toolbar.

Step 7: Run the Project:

Click the run button on the toolbar.

Warnings

No limit: Runs the process even if n warnings are present

Abort job after: Aborts the process after encountering the specified no. of warnings.

Note:

Before clicking Run close your src file and target file

Link Color status during run time.

Black-process not started

Blue-process is going on

Red- Process aborted

Green-Process completed successfully

Step 8: Run Director:

Now Goto->Tools->Run Director

It maintains run logs for all the projects.

To view logs: select the desired project and goto ->view -> log

Exercise 2: Pump the data from source to target with some constraints using ‘FILTER’ Stage

Filter is used for restricting each row of a file based upon certain conditions set against a/multiple fields in the row.

Eg: Select * from emp where sal>10000;

Step1: Create a new parallel project

Step 2: Save the project with a name.

Step 3: Drag and Drop three sequential files into the work area.

Step 4: Drag and Drop a Filter from processing option on palette into work area

Step 5: Create a source file named “src.txt”

Step 7: Set sequential_File_0 properties same as in exercise 1.

Step 8: Set Filter Properties as follows.

Setting Constraints:

Predicates:

1st Where clause condition for the link DSLink12(sal<=10000)

the sequential_file_1 will have the rows satisfies the above constraint

2nd where clause condition for the link DSLink11(sal>10000 and sal<=20000)

the sequential_file_2 will have the rows satisfies the above constraint

Options:

output Rejects=true for DSLink10 and right click on the DSLink10->select Convert to Stream

Keep Output Rejects=false ; if there is no

Now the sequential_file_3 will have the rows that are rejected from the above two constraints

Output Settings:

Mapping Columns:

1. Select the output link from the combo box2. Drag and Drop the columns from left to right side3. Redo the above steps for all the output links

Step 9: Set sequential_file_1, sequential_file_2, sequential_file_3 properties same as in exercise 1.

Step 10: Compile

Step 11: Run the project and observe the output.

Exercise-3: Load the target file from multiple src files using ‘Funnel’ stage



Step 3: Drag and Drop four sequential files into the work area and rename them as src1, src2, src3 and target respectively.

Step 4: Drag and Drop a funnel from processing option on palatte into the work area.

Step 5: Set the src 1,src 2, scr 3 properties same as in exercise 1.

Step 6: Set Funnel Properties as follows

Properties settings

Funnel Type=Continuous Funnel.

Target file is loaded with all the src files in the order in which the src link comes to the funnel.

Funnel Type=Sequence Funnel.

Target file is loaded with all the src files in the order with which the src files are place in the work area.i.e., from top to bottom.

Funnel Type=Sort Funnel.

Target file is loaded with all the src files in the sorted manner based on the sor key value and sort order.

Output settings:

Step 7: set target file properties same as in exercise 1.

Step 8: Compile

Step 9: Run the project

Output:

Source files:

Target File on

1. Funnel Type=Continous Funnel

2. Funnel Type=Sequence Funnel

3. Funnel Type=Sort Funnel with key=ename and sort order=Ascending.

Exercise- 4: Pump the target file from the source file in the sorted order using ‘SORT’ stage



Step 3: Drag and Drop two sequential files into the work area.

Step 4: Drag and Drop sort from processing option on the palette into the work area.

Step 5: set sequential_file_0 properties same as in exercise 1.

Step 6: set sort properties as follows

Output setting:


Step 8: compile and run the project.

OUTPUT:

Source file:

Target File:

Sort can also be performed with the link directed from funnel

The above case won’t work because Funnel link should be directed directly to Sort

Exercise -5: Load the target file after removing duplicate rows from the src file using ‘Remove Duplicates’ stage.




Step 4: Drag and Drop ‘Remove Duplicates’ from processing option on the palette into the work area.


Step 6: set ‘remove duplicates’ properties as follows.

Key=eno (Key column for the operation)

Duplicate to Retain=Last.

Row Duplicates:

Eno, ename, salary

101,gokul,10000

102,gopal,20000

101,gokul,15000

101,gokul,25000

103,kumar,20000

The record (101,gokul) has been duplicated for 3 times with different salary values. We need the latest updated row. So use the stage ‘Remove Duplicates’ as it removes all the duplicate rows keeping the last (or) first row retained.

Duplicate row search is made using the key, ‘eno’ in our case.

We can customize the duplicate to be retained by setting Duplicate to Retain=Last | First.

Output Settings:



OUTPUT FOR THE ABOVE SETTINGS:

SOURCE FILE:

TARGET FILE:

Exercise 6: Join the rows in two src files and load them into the target using ‘JOIN’ stage




Step 4: Drag and Drop ‘Join’ from processing option on the palette into the work area.

Step 5: Set sequential_file_0 and sequential_file_1 properties same as in exercise 1 but select a key in both files with which the join has to be made. In our example we have selected the key as ‘eno’.

Step 6: set join properties as follows.

Key= eno

Join Type= Inner|Left outer|Right outer|Full Outer

Output Settings:

Note:

While Joining keep your small table as left table and big table as right table for better performance.

Step 7: set sequential_file_2 properties as same as in exercise 1.

Step 8: Compile and Run the project.

OUTPUT:

Source File 1 and 2:

Target file after Inner Join:

Target file after Left outer join:

Target file after Right outer join:

Target file after full outer join:

Exercise -7: Generate n number of dummy records under a defined table or structure using ‘Row Generator’ stage.



Step 3: Drag and Drop a sequential file into the work area.

Step 4: Drag and Drop Row Generator from Development/Debug option on the palette into the work area.

Step 5: Set Row_Generator properties as follows

Output Settings:

Specifying the length and scale values is important here.

Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal domain column with same no. of digits.

Length value for char is fixed length.(all the values of char domain column have fixed no. of characters)

Length value for integer and varchar is their upper limit i.e., the max no. of digits for integer and the max no. of characters for varchar.

Step 6: Set sequential_file_1 properties as same as in exercise 1.

Output:

Target File:

Exercise 8: Load data from a flat src file to a target oracle database using ‘oracle connector’ stage.




Step 4: Drag and Drop oracle connector from Database option on the palette into the work area.


Step 6: Starting Oracle services.

Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener, OracleServiceORCL services

Step 7: set oracle_connector properties as follows.

Check oracle connectivity by pressing the Test button under connection.

You can also View Data that has been imported using View Data button under usage.

Output Settings:

Specifying the length and scale values is important here.




Step 8: Compile and run the project.

Output:

Source File:

Target:

Username: Scott/tiger@orcl

Exercise 9: Load data from an oracle database to a target flat file using ‘oracle connector’ stage.




Step 4: Drag and Drop oracle connector from Database option on the palette into the work area.

Step 5: Starting Oracle services.

Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener, OracleServiceORCL services

Step 6: Import a table (This will take a snapshot of the original table and this snapshot is used for further processing with better performance since reading each and every record from the oracle database via an oracle connection requires more overhead)

Since importing a table is equivalent to a snapshot, you have to perform it for each time whenever the table faces any changes.

The changes you are making in the table should be committed before importing it into the datastage, especially in oracle.

Username : scott

Password : tiger

Step 7: Set the oracle_connector properties as follows.

Column Settings:

Load the columns from the ‘employee’ table as follows

a. Click the button loadb. Select the table from the ‘table definitions’ wizard.c. Select the desired columns from the ‘select columns’ wizard

OUTPUT:

Target File:

Exercise 10: Load data from teradata database to oracle database using ‘Teradata connector and Oracle Connector’ stage.



Step 3: Drag and Drop Teradata Connector and Oracle Connector from the Database option on the palette into the work area.

Step 4: Start teradata services.

Step 5: Import a teradata database.

Username: tduser

Password: tduser

Step 6: Set Teradata_Connector properties as follows.

Check oracle connectivity by pressing the Test button under connection.

You can also View Data that has been imported using View Data button under usage.

Column Settings:

Procedure is same as in exercise 9.

Specifying the length and scale values is important here. (from any db to db (or) from file to any db)




Step 7: Set Oracle Connector properties as same as in exercise 8.


OUTPUT:

Target:

Username: Scott/tiger@orcl

Exercise 11: Load data from oracle database to teradata database using ‘Teradata connector and Oracle Connector’ stage.



Step 3: Drag and Drop Teradata Connector and Oracle Connector from the Database option on the palette into the work area.

Step 4: Start oracle and teradata services.

Step 5. Import an oracle table.

Step 6: Set Oracle_Connector properties as same as in exercise 9.

Step 7: Set Teradata_Connector properties as follows.


Output:

At Teradata

Exercise 12: Load data from an Teradata database to a target flat file using ‘Teradata connector’ stage.




Step 4: Drag and Drop teradata connector from Database option on the palette into the work area.


Step 5. Import a teradata table.

Step 6: Set teraddata_Connector properties as same as in exercise 10.

Step 7: Set Sequential_File properties as same as in exercise 1.


OUTPUT:

Source table and Target Flat file.

Exercise 13: Load data from an a target flat file to a Teradata database using ‘Teradata connector’ stage.




Step 4: Drag and Drop teradata connector from Database option on the palette into the work area.


Step 5: Set Sequential_File properties as same as in exercise 1.

Step 6: Set teraddata_Connector properties as same as in exercise 10.


OUTPUT:

Source Target flat file and Target teradata table.

Exercise 14: Load data from teradata database to a teradata database using ‘Teradata connector’stage.



Step 3: Drag and Drop two Teradata Connectors from the Database option on the palette into the work area.


Step 5. Import a teradata table.

Step 6: Set teradata_connector_0 properties as same as in exercise 10.

Step 7: Step 6: Set teradata_connector_1 properties as same as in exercise 11.


OUTPUT:

Source new_emp teradata table and Target cpy_emp teradata table.

Exercise 15: Load data from oracle database to a oracle database using ‘oracle connector’stage.



Step 3: Drag and Drop two oracle Connectors from the Database option on the palette into the work area.

Step 4: Start oracle services.

Step 5. Import an oracle table.

Step 6: Set oracle_connector_0 properties as same as in exercise 11.

Step 7: Step 6: Set oracle_connector_1 properties as same as in exercise 10.


OUTPUT:

Source oracle table ‘dept’:

Target Oracle table ‘cpy_dept’:

Exercise 16: Perform some aggregations on the src flat file and load them into a target flat file using ‘Aggregator’ stage.




Step 4: Drag and Drop Aggregator from processing option on the palette into the work area.


Step 6: Set Aggregator properties as follows.

Select deptid, max(sal) “Max_Sal” from emp group by deptid;

Group = deptid (group by column)

Aggregation Type=Calculation|Count Rows | Re-calculation

Column For Calculation=sal (column on which the aggregation has to be performed)

Maximum Value Output Column=Max_Sal (Alias name )

Column Mapping:

Column Settings

By default data type for all aggregation type will be Double. So reset the type as per your desire.

Step 7: Set Sequential_File_1 properties as same as in exercise 1.


OUTPUT:

Source File

Target File on ‘Select deptid, max(sal) “Max_Sal” from emp group by deptid;’

Exercise 17: Load from src flat file to a target flat file with some derived columns using ‘Transformer’ stage.




Step 4: Drag and Drop ‘Transformer’ from processing option on the palette into the work area.


Step 6: Set transformer properties as follows.

Drag and Drop the columns on which derivations have to be performed from left to right (Column Mapping).

In the right hand side right click on each column and select function->any desired function, then the function prototype will be loaded in the column.

Edit the column as per the prototype (for ex: on selecting UpCase, UpCase(%string%) will be loaded. Edit the parameter value as DSLink5.ename)

Deriving ‘Grade’ column from the sal column using If Else with the same procedure as above.

At the right bottom side rename the columns if you want (Here we are renaming ‘ename’ as ‘Emp_Name’, ‘sal’ as ‘Annual_salary’ ). Changes will get updated in DSLink6 table.

Be conscious in setting the datatype for each derived columns.



OUTPUT:

Source File:

Target File:

Exercise 18: Compare two tables (DWH and OLTP) and Capture the changes in OLTP table with respect to DWH table then load the changes to a flat file using ‘Change Capture’ stage.



Step 3: Drag and Drop oracle connectors from database option on the palette into the work area.

Step 4: Drag and Drop a ‘sequential file’ from file option on the palette into the work area.

Step 5: Drag and Drop ‘change capture’ from processing option on the palette into the work area.

Step 6: Create two tables student and dupstudent with the structure (rollno,name,age,deptid) and insert same records in student and dupstudent. Make some changes in the dupstudent table (new insert,delete,update).

Step 7: set oracle connector properties as same as in exercise 9.

Step 8: set change capture properties as follows.

Setting Properties

Change key=rollno (a column that will never change on which the comparison between the tables will occur).

Change Value= Age, Deptid, Name (columns whose values get change over time)

Drop Output For Copy, delete, edit, insert= False

If two tables contains exactly similar records then don’t leave that record, forward that record to the flat file.

If a record in student is not present in dupstudent (deleted) then forward that record to the flat file.

Similar actions on edit (update) and insert will occcur.

Column Settings:

The change capture generates a column called change_code by default which indicates the following.

Copy-0

Insert-1

Update-2

Delete-3

Column Mappings:

Step 9: set sequential_file properties as same as In exercise 1.


OUTPUT:

Source tables:

Target File:

Exercise 19: Look up for the existence of records in DWH table with respect to OLTP table and join the records using ‘Look Up’ Stage




Step 4: Drag and Drop ‘Look Up’ from processing option on the palette into the work area.

Step 5: Set OLTPSRC and DWHSRC file properties as same as in exercise 1.

NOTE: Always oltp file should be at the top and dwh file should be at the bottom in the work area else error on running the project will occur.

Step 6: set look up properties as follows.

Create a link with ‘dno’ from oltp_link to dwh_link which act as a key for comparison.

Drag and Drop the desired columns from oltp_link and dwh_link to target_link.

Step 7: set target file properties as same as in exercise 1.


OUTPUT:

Source Files (DWH and OLTP):

Result: Execution success

Target File:

Inference:

If look up finds the existence of all the related records in DWH table with respect to OLTP table on using a key (here dno) then it will join those records and the join type is ‘natural join with using clause’.

So lookup can act as join with the above restriction.

Source files (DWH and OLTP):

Result:

Inference:

Since a record with the key (dno=6) in the oltp table is not exists in the dwh table, error occurred.

Exercise 20: Maintain logs of changes made in DWH table with respect to OLTP table using ‘SLOWLY CHANGING DIMENSION’ stage.



Step 3: Drag and Drop three oracle connectors from database option on the palette into the work area.

Step 4: Drag and Drop a ‘sequential file’ from file option on the palette into the work area.

Step 5: Drag and Drop ‘Slowly Changing Dimension’ from processing option on the palette into the work area.

Step 6: Create a table oltp with the following description and insert some records then commit.

Step 7: Create a table deptdwh with the following description.

Step 7: set OLTP oracle connector properties as same as in exercise 9 and use oltp table.

Step 8: Set DWH oracle connector properties as same as in exercise 9 and use deptdwh table.

Step 9: Set Target_DWH oracle connector properties as same as in exercise 9 and use deptdwh table.

Step 10: Set Fact sequential file properties as same as in exercise 1.

Step 11: Set Slowly Changing Dimension as follows.

Fast Path: 1 of 5

Select output link as fact (sequential file).

Fast Path: 2 of 5 (Input)

Map the key column between oltp and dwh table.

Fast Path: 3 of 5 (Input)

Set Initial Value as 1

Create a txt file ‘System.txt’ in C:\ for system reference.

Give that file path under Source name:

Fast Path: 4 of 5 (Output)

Map columns for the Fact (sequential file).

Always map common columns from oltp table.

Fast Path: 5 of 5 (Output)

At Initial Stage:

Set Derivation, Purpose and Expire for columns.

Derivation and Expire can be set by double click->right click->function->desired function on the respective columns.

Purpose Settings:

Business Key: primary key

Surrogate key: to locate changes (for system reference)

Type 1: Non-Changeable values but not but not a business key (eg: Date of birth).

Type 2: Changeable values.

Effective Date: Entry date of the record

Expiration Date: Entry date of immediate duplicate record (so initially set it as null)

Current Indicator: Indicates the active record

Active-1

Inactive-0

Fast Path 5 of 5 (output) at final stage:

After setting the fast path: 5 of 5, fast path: 2 of 5 will become as


OUTPUT:

Deptdwh table is inserted with the records from oltp table with stdate as current date, expdate as null and CID as 1(active record).

Fact file content:

After Making the following changes on oltp table

Deptdwh table is inserted with changed records as well as newly inserted records at oltp with stdate as current date, expdate and cid.

The ‘dname’ value of the row with deptno =10 is changed from ‘C’ to ‘JAVA’ .

The old record gets the expiration date as the starting date of the newly updated record

Current indicator (cid) of old record= 0 and for new record, cid=1.

Fact file content:

Exercise 21: PIVOT STAGE




Step 4: Drag and Drop ‘pivot’ from processing option on the palette into the work area.

Step 5: set sequential_file_0 properties as same as in exercise1.

Step 6: Set Pivot properties as follows.

Input settings:

Output Settings:

Step 7: set sequential_file_2 properties as same as in exercise1.


OUTPUT:

Source File:

Target File:

NOTE: Datatype of all horizontal columns except the primary key column in source table should be same. In our case q1, q2, q3 column in source table are integers. So that all these columns can fit into the column ‘q’ with integer datatype in target table.

Exercise 22: Run the jobs in sequential manner (one after other) using ‘Sequence Job’

Sequence Job is mainly used for executing the jobs one after other.

It is very essential to execute the jobs in a particular sequence in which one job depends on the finished execution state of another job.

For example consider the following query,

Select e.eno,e.ename,e.deptno,d.deptname from emp e join dept d on(e.deptno=d.deptno) where e.deptno in(10,20,30) order by 2;

The above query needs to execute three jobs (1. Join, 2. Filter, 3. Sort) in sequence.

Step1: Create a new sequence project


Step 3: Drag and Drop the jobs you want to execute sequentially from repository into the work area.

Step 4: Link the Jobs


Step 6: Open the run directory and observe the logs for successful execution of all the jobs.

datastage experiments

Documents