bi tookit -datastage v1.0

© Copyright IBM Corporation 2006

(Optional client logo can

be placed here)

Disclaimer(Optional location for any required disclaimer copy.

To set disclaimer, or delete, go to View | Master | Slide Master)

IBM Global Business Services

Course Title

Business Intelligence (BI) Development Toolkit for Datastage

Duration of course: X hours

Presentation Title | IBM Internal Use | Document ID | Apr 11, 2023 2



Course Objective

At the completion of this course you should be

able to understand :

Overview of processes followed in a standard development project.

Various phases and related work product associated with the development process.

Importance of generating various work products.

Standard / Best practice / Tip & tricks specific with the tool.

Insight about different types of projects.

Different types of testing.




Course Content

Module 1: DataStage Low Level Design

Module 2: DataStage Coding Standards

Module 3: DataStage Best Practices – Tips & Tricks

Module 4: Version Control



be placed here)




Course Title

Module 1 : DataStage Low Level Design

BI Development Toolkit for Datastage




Module Objectives

At the completion of this chapter you should be able to:

– Understand the concept of Low Level Design process.

– Know how a Low Level Design Document looks like.




Low Level Design : Agenda

Key points described in the Low Level Design

Topic 1 :Introduction.

Topic 2 Objectives/Purpose.

Topic 3 : Scope.

Topic 4 :Core Aspects Of Design.

Topic 5 :Low Level Technical Overview.

Topic 6 :Low Level Technical Design.




DW/BI Development Process Flow

FunctionalSpecification

Estimate Sign OffDetailedDesign

Build Functional Specification

Estimate Estimate Sign OffSign OffDetailedDesign

TechnicalDesign

Build Develop

DeploymentDeploymentBuild & Unit TestBuild & Unit TestDesign ( Macro & Micro)Design ( Macro & Micro)Solution OutlineSolution Outline

QA Checkpoints-

-Onsite/Offshore

WorkshopWorkshop

IssueIssue

Issue Resolution

Issue Resolution

Completed FSCompleted FS

- Offshore

- Onsite

Delivery PlanDelivery Plan

EstimationOK ?

EstimationOK ?

YesNo

WorkshopWorkshop

IssueIssue

Issue Resolution

Issue Resolution

Completed FSCompleted FS

- Offshore

- Onsite

Delivery PlanDelivery Plan

EstimationOK ?

EstimationOK ?

YesNo

Estimation and

Functional Spec ReviewFunctional

Spec Review

Technical Design

Technical Design

Technical Design

Approval

Technical Design

Approval

Onsite TestingOnsite Testing

TPR/SCR Logging

TPR/SCR Logging Issues

Development Complete


Acceptance

Send for Onsite

Acceptance

Send for Onsite

AcceptanceCodingOK ?

CodingOK ?

Yes

Peer

Review

Technical Design

Technical Design

QA Technical Design

Technical Design

Approval

Technical Design

Approval

Onsite TestingOnsite TestingOnsite Testing

Onsite Testing

TPR/SCR Logging

TPR/SCR Logging Issues



Acceptance

Send for Onsite

Acceptance

Send for Onsite

AcceptanceCodingOK ?

CodingOK ?

YesNo

ReworkRequired

Coding and Unit Testing by

Developer

Coding and Unit Testing by

Developer

Signoff byTeam Lead

Peer

Review

Technical Specification

Unit Test Plan

Offshore Knowledge Transfer

(UAT/System Test/

IntegrationTest




What is a Low Level Design?

The Low Level Design details all the technical aspects

involved in the Data Stage ETL process with respect to the following:

Source/Target Names and Locations

This section contains the name of the source/target table or File

names,schema details for tables or server details for files.

Source/Target Structures i.e table structure or file structure

This describes the field names in a table along with their

datatypes or if it is a Delimited or fixed width one for flat file.

Source To Target mapping

Explains how data flows from source to target.




What is a Low Level Design?

QA to find any data quality issues.

Jobs/Sequences/Master Sequencer Details.

This section shows the name of the Jobs, Sequences and Master Sequencer's along with the transformation details

Partitioning Information if any.

Scheduling Information etc.




Sample Low Level Design

LLD_Template




Key Points

Step Overview: This shows the key elements e.g the inputs,outputs,key activities involved etc. along with the artifacts.

Key Activities:

Analysis of High Level Design Identify key elements to be

included in the Low Level Design.

Understanding of the entire flow from source to target also with mapping rules

Outputs

Technical Specification

Inputs

High Level Design

Roles

Developer

Templates and Sample Artifacts

Sample Artifact



be placed here)




Course Title

Module 2 : Datastage Coding Standards





Module Objectives


– Know the Job Level Naming conventions used in Data Stage.

– Know the Parameter Naming conventions used in DataStage.

– Know proper Documentation standards/Commenting within the Job.

– Know proper Use of Environmental/Generic parameters as a standard practice.

– Identify the key Coding standard principles.




Datastage Coding Standards : Agenda

Topic 1 :Coding standards

– Repository structure DataStage.

– ETL Coding standard guidelines.

Topic 2 : Job Naming Conventions

– Stage Naming Conventions.

– Link Naming Conventions.

– Container Naming Convention.

– Parameter Naming Convention.




Datastage Coding Standards : Agenda

Topic 3 : Job Naming Conventions

– Stage Naming Conventions.

– Link Naming Conventions.

– Container Naming Convention.

– Parameter Naming Convention.




What is a Coding standard?

The set of rules or guidelines that tells developers how they must write their code. Instead of each developer coding in their own preferred style, they will write all code aligning to ETL standards ensuring the consistency of the designed ETL application throughout the project.

Benefits

Reducing development time.

Enabling new members of the team to quickly pick up development.

Allowing for flexibility in exchanging team members between the Data Conversion and the Data Warehouse / Reporting teams.

Providing a template to follow.

Enabling multiple teams/team members to work on multiple phases; Serving as a basis (after the completion of the pilot project) for the development of jobs for all other countries.

Making use of the GUI, and self-documenting nature of the tool.

Maintainability.

Coding standard




Repository structure :

The repository is the central storing place for ‘Build’ related components. It is a key component of the software whilst developing jobs in DataStage Designer

Data Elements - A specification that describes the type of data in a column and how the data is converted. Server jobs only.)

Jobs – Folder for jobs that are built, compiled and run.

Routines – The BASIC language can be used to write custom routines that can be called upon within server jobs. Routines can be re-used by several server jobs.

Shared Containers – A shared container is a re-useable item stored in the repository and available to any job in the project.

Stage Types – Any stage used in a project – this can be data source, data transformation, or data warehouse.

Table Definitions - A definition describing the data you want including information about the data table and the columns associated with it. Also referred to as meta data.

Transforms – Similar to routines these take one value and compute another value from it.

Coding standards




ETL Coding standard guidelines:

By using a simple repository structure, it is easier to navigate and find the components that are needed to build a job, and if a number of complicated schedules are used, can also show the flow of jobs.

It is a good idea to set up a folder structure based on a common feature, notably the architectural area.

For each of these groups a Jobs and a Sequences folder is created. Thus, for each group two separate folders are created under the Jobs folder. These groups in turn can be divided into subgroups (and thus subfolders.

Templates are stored in a separate Templates folder directly under the Jobs folder. It is expected that a small number of templates will suffice to create jobs at all levels, so that there is no need to create specific folders for templates at every level.

. Thoughtful naming of jobs and categories will help the developer in understanding the structure.

If multiple versions of a source system are supported then it is a good idea to reflect the version number in the folder name, so that it is clear which version the corresponding jobs, sequences and templates were written for.

Coding standards




Job Templates :

Each project should contain job templates in order to ensure that jobs are created with the proper amount of job parameters, and the correct job parameter names. These job templates are stored in a separate Templates folder directly under the Jobs folder .

Jobs and Sequences

Jobs can be grouped into folders based on a common feature, notably the architectural area they belong to. Thus, for each group a separate folder is created under the Jobs folder. These groups in turn can be divided into subgroups (and thus subfolders).

Table Definitions

The Table Definitions section contains metadata which can be imported from a number of sources, e.g. Oracle tables, or flat files. The folders that this metadata is stored in must represent the physical origin or destination of a table or file. The recommended naming standard (and the default for ODBC) is:

1st subfolder: database type (ODBC, Universe, DSDB2, ORAOCI9)

2nd subfolder: database name.

Coding standards




Hash Files :

Hash files can be stored either in Universe, or in the file system of the operating system.

Sequential Files :

A DataStage project will potentially use source, target, and intermediate files. These can be placed in separate directories. This will:

Simplify maintenance.

Allow data volumes to be spread evenly across multiple disks.

Allow for closer monitoring or file system.

Allow for closer monitoring of data flow.

Aid housekeeping processes.

Coding standards




What is 'Naming Convention'?

This is an industry accepted way to name various objects.

A variety of factors are considered when assessing the success of a

project. Naming standards are an important, but often overlooked

component. Appropriate Naming convention-

Establishes consistency in the repository,

Provides a developer friendly environment.

Benefits: Facilitates smooth migrations and improves readability for anyone

reviewing or carrying out maintenance on the repository objects.

It helps to understand the processes being affected thereby saving significant time.

Naming Conventions




The following pages suggest naming conventions for various repository components .Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test execution documents.

Naming Conventions




Project Naming Conventions

Component/ Parameter Suggested Naming Conventions

Project Name Typically a project contains a set of sequences / jobs / routines / table definitions / etc. This may be a particular release or version and is very much dependent on the project circumstances. The project name cannot contain spaces and punctuation.

Distinction will be made according to the project stages: Development, Test, Acceptance, and Production, which will be appended to the project name in abbreviated (three character) format.




Job Naming Conventions

Component /Parameter Suggested Naming Conventions

Job

The job names used are very much dependent on the project. Usually job names contain a subject area (the target table), and possibly a job function (load, transform, clear, update, etc).

Job names have to be unique across all folders.

For projects, the standard chosen is:

<job function>_<target_table>




Stage Naming Conventions

Passive stages : A passive stage indicates a data component, such as a sequential file, an Oracle table, or an ODBC source. In active stages some kind of processing occurs, such as sorting, transforming, aggregating, etc

Generic Convention : <data source type>_<data source name>

where data_source_type is a two - four character (preferably three) abbreviation which is as clear and unambiguous as possible


Sequential File Seq_<data source name>

Complex Flat File Cff_<data source name>

Hash file Hsh_<data source name>





Object/Parameter Suggested Naming Conventions

XML file Xml _<data source name>

Oracle database Ora_<data source name>

DB2 database DB2_<data source name>






ODBC source Odbc_>_<data source name>

File transferred via FTP Ftp >_<data source name>

Siebel DA Sbl_< data source name>

Dataset Ds _ data source name>





Active stages : In active stages some kind of processing occurs, such as

sorting, transforming, aggregating, etc

Generic Convention : <stage_type>_<functional_name>

In case of a transformation, the functional_name typically consists of a verb

(indicating the action that is performed) and a noun (the object of the action).


Command Cmd _<functional_name>

Aggregator Agg _<functional_name>

Folder Fld _<functional_name>





Object/Parameter Suggested Naming Conventions

Filter Fltr _<functional_name>

Inter Process Ipc _<functional_name>

Link Partitioner Lpr _<functional_name>

Lookup Lkp _<functional_name>






Merge Mrg _<functional_name>

Sort Srt _<functional_name>

Transformer Xfm _<functional_name>





Component/Parameter Suggested Naming conventions

Change Data Capture Cdc_<functional_name>

Funnel

Fnl/Club _<functional_name>

Join Join _<functional_name>





Component/Parameter Suggested Namin Conventions

Surrogate Key Generator SKey _<functional_name>

Remove Duplicates

Ddup _<functional_name>

Copy Cpy _<functional_name>




Link Naming Conventions

Links must have a descriptive name. Unlike the stages, they start with a non-capital. If possible, let the name resemble the preceding stage name, but without the stage type, and using the past participle of the verb used in the preceding stage name.

Examples:

enrichedCustomer

sortedOrders




Container Naming Conventions

Shared Containers

The names of Shared Containers start with Scn_ , followed by a meaningful name describing its function.

Local Containers

The names of Local Containers start with Lcn_ , followed by a meaningful name describing its function.

Stage Variable :

A Stage Variable is an intermediate processing variable that retains its value during

read but does not pass its value to a target column.

Stage variable names start with stg_ and reflect their usage.

A standard must be set so that common stage variables are named consistently.




Parameter Naming Conventions

Parameters

A parameter name should clearly reflect its usage.

General

The general naming convention is: P_<name>

Database Parameter Suggested Naming Conventions

Data Source Name P_DB_<logical db name>_DSN

User Identification

P_DB_<logical db name>_USERID

User authentication Password P_DB_<logical db name>_PASSWORD




Parameter Naming Conventions

Directory (path) parameters Suggested Naming Conventions

source data for the job

P_DIR_INPUT

Destination directory

P_DIR_OUTPUT

Directory for temp DS files P_DIR_TEMP

Directory for error-reporting files P_DIR_ERRORS

Directory where csv and other reference data is held.

P_DIR_REF

For directory (path) parameters the convention is: P_DIR_<usage>The following directory parameters have been identified:




Datastage Coding Principles and Standards

Suggested Methods of Working :

Before editing a job, verify that the job in development is identical to the one in production. If not, request a copy from the production system.

Create a backup copy of the job you are going to edit, so that you are able to return it to its original state if needed.

After development has finished, cleanup any backup copies of jobs you have created, so that there will be no misunderstandings as to what the correct job is.




Documentation practices in a job

Incorporating Comments:

One challenge of internal software documentation is ensuring that the comments are maintained and updated in parallel with the source code. Although properly commenting source code serves no purpose at run time, it is invaluable to a developer who must maintain a particularly intricate or cumbersome piece of software.

Jobs Commenting :

Document all jobs in their Job Properties:

Provide a short description containing a short, meaningful description.

Provide a Long description containing a history of version, date, changes made and by whom.

Include a reference to the design, including its version.

Document any special file references.

When modifying jobs, always keep the short and long descriptions in the Job Properties up to date.




Documentation practices in a job

Routines and Functions

Routines and functions are documented in the short and long description fields (as are Jobs), and in the code via comments.

The comments in the short and long description fields (on the General tab) are similar to job comments.

Provide a short description containing a short, meaningful description.

Provide a Long description containing a history of version, date, changes made and by whom.

Include a reference to the design, including its version.

Document any special file references.

When modifying jobs, always keep the short and long descriptions in the Job Properties up to date.




Suggested Coding principles Avoid clutter comments, such as an entire line of asterisks. Instead, use white space to separate comments from

code.

Avoid surrounding a block comment with a typographical frame. It may look attractive, but it is difficult to maintain.

Use complete sentences when writing comments. Comments should clarify the code, not add ambiguity.

Comment as you code because you will not likely have time to do it later. Also, should you get a chance to revisit code you have written, that which is obvious today probably will not be obvious six weeks from now.

Comment anything that is not readily obvious in the code.

To prevent recurring problems, always use comments on bug fixes and work-around code, especially in a team environment.

Use comments on code that consists of loops and logic branches. These are key areas that will assist source code readers.

Establish a standard size for an indent, such as three spaces, and use it consistently. Align sections of code using the prescribed indentation.




Use of parameters

Definition

Job parameters allow you to design flexible, reusable jobs, making a job independent from its source and target environments.

If, for example, we want to process data using a certain userid and password, we can include these settings as part of your job design. However, when we want to use the job again for a different environment, we must most likely edit the design and recompile the job.

-- Instead of entering constants as part of the job design, you can set up parameters which represent processing variables.




Use of parameters

Creating Project Specific Environment Variables :

Here are the steps to standard steps to follow:

Step 1 -> Start up DataStage Administrator.

Step 2 ->Choose the project and click the "Properties" button.

Step 3-> On the General tab click the "Environment..." button.

Step 4->Click on the "User Defined" folder to see the list of job specific environment variables.

Step 5->Type in all the required job parameters that are going to be shared between jobs




Use of parameters

Using Environment Variables as Job Parameters :

Step 1->Open up a job.

Step 2->Go to Job Properties and move to the parameters tab.

Step 3-> Click on the "Add Environment Variables..." button (which doesn't add an environment variable but rather brings an existing environment variable into your job as a job parameter).

Step 4-> Add these job parameters just like normal parameters to stages in your job enclosed by the # symbol, for example:

– Database=#$DW_DB_NAME#

– Password=#$DW_DB_PASSWORD#

– File=#$PROJECT_PATH#/#SOURCE_DIR#/Customers_#PROCESS_DATE#.csv




Use of parameters

Points to Note :

We set the Default value of the new parameter to "$PROJDEF" to ensure it dynamically set each time the job is run.

When the job parameter is first created it has a default value the same as the Value entered in the Administrator. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest Value for this variable at job run time

Set the value of these encrypted job parameters to $PROJDEF. We need to type it in twice to the password entry box.

The "View Data" button will not work in server or parallel jobs that use environment variables set to $PROJDEF or $ENV. This is a defect in DataStage. It may be preferable to use environment variables in Sequence jobs and pass them to child jobs as normal job parameters. eg. In a sequence job $DW_DB_PASSWORD is passed to a parallel job with the parameter DW_DB_PASSWORD.




Application examples

Environment:

Database name, username, password :

Database names or access details can vary between environments or can change over time. By paramaterising these at Project level any change can be quickly applied without updating or recompiling all Jobs.

File names and Location :

All file names and locations were specific to each run thus the filenames themselves were hard coded but the file batch and run reference and related location were parameterised .




Application examples

Process Flow :

Parameters can be manually entered at runtime, however, to avoid data entry errors and speed up turnaround, parameter files were pre-generated and loaded within DataStage with minimal manual input.

Generic Parameters

It is often seen that a number of parameters will apply across the whole Project. These will relate to either the Environment or specific Business Rules within the mappings. For example:

MIGRATIONDATE – “set to the date the extract was taken”

TARGETSYSTEM – “set to the test environment name due to be loaded with data from this run .



be placed here)




Course Title

Module 3 : Datastage Best Practices / Tips and Tricks





Module Objectives


–Describe Datastage Best Practices and Tips

–Define Datastage Best Practices and Tips

–Demonstrate Datastage Best Practices and Tips

–Etc.

52



Datastage Best Practices / Tips and Tricks :Agenda

Getting started Prerequisites Overview of the Data Migration Environment implemented Estimating a Conversion Preparing the DS environment

Creating Project Level Parameters1. Designing Jobs

6.1. General Design Guidelines6.2. Ensuring Restartability6.3. Sample Job Template6.4. Extracting Data6.5. Transforming the Extracted Data

6.5.1.Performing Lookups6.5.2. Lookup stage Problem6.5.3. Using Transformer6.5.4. Transformer compared to Dedicated stages6.5.5. Tips: Sorting6.5.6. Tips: Removing Duplicates6.5.7. Null Handling 6.5.8. When to configure nodes and partitioning

53




6.6. Capturing Rejects6.7. Loading Valid Data6.8. Sequencing the jobs6.9. Job sequence vs Batch Scripts6.10 Tips: Releasing locked Jobs6.11. Mapping multiple stand-alone jobs in one single job6.12 Dataset Management6.13 Ensuring Restartability5. Troubleshooting

5.1 Troubleshooting: Some debugging Techniques5.2 Oracle Error Codes in DataStage5.3 Common Errors and Resolution5.4 Tips: Message Handler5.5 Local runtime Message Handling in Director 5.6 Tips: Job Level and Project Level Message Handling5.7 Using Job Level Message Handler

54



6. Unit Testing of the modules

6.1. General Design Guidelines

6.2. Ensuring Restartability

6.3. Sample Job Template

6.4. Extracting Data

6.5. Transforming the Extracted Data

6.6. Capturing Rejects

6.7. Loading Valid Data

6.8. Sequencing the jobs

6.9. Job sequence vs Batch Scripts

6.10 Tips: Releasing locked Jobs

6.11. Mapping multiple stand-alone jobs in one single job

6.12 Dataset Management

Datastage Best Practices / Tips and Tricks : Agenda

55



7. Maintenance Activity7.1 Backup and version control Activity7.2 Version Control in ClearCase7.3 DS Auditing Activity7.4 Retrieving Job StatisticsAssuring Naming Conventions of components, jobs and categories7.5 Performance Tuning of DS Jobs

8. Preparing UTP- guidelines


56




9 Taking whole project backup

Taking Job level Export

Taking folder level Export

9.1 Backup and version control Activitties

Version Control in ClearCase

9.2 DS Auditing Activity

Tracking the list of modified jobs during a period

Retrieving Job Statistics

Getting the row counts of different jobs

57




9.3 Performance Tuning of DS Jobs

Analyzing a flow

Measuring Performance

Designing for good performance

Improving performance

9.4 Assuring Naming Conventions of components, jobs and categories

9.5 Scheduled Maintenance

58



1. Getting Started

In a typical Data Migration Environment, we have defined the roadmap to implement the design using WebSphere DataStage and some tips and tricks along with, acquired through experience.

Designing the architecture

Preparing the DS environment

Job Development Phase : creating the estimation model

Job Development Phase : designing the job template

Job Development Phase : Delivering Code Modules

Job Enhancement Phase : Version Control

DataStage Auditing Activity

DataStage Maintenance Activity

59



2.Prerequisites

The following documents should be in place before we jump into job development:

1.DataStage Estimation Model

2.DataStage Naming Convention Standards to be followed

3.Job Design Templates

4.Approach towards Backup and Version Control Activity

5.Issue Checklist template

6.Job Review Checklist template

7.Unit Testing Template

60



3.Overview of Data Migration environment

DataStage Requirement:Cleansed data is populated into staging area 0 from stage Legacy( which holds the cleansed records from legacy systems)

Client specific business rules have to be validated during stage0 to stage1 load primarily.

Staging 2 is the final target of DataStage load. Remaining validations can be applied here. Staging 2 records can be used by other applications to load finally to target ERP

In Staging area 0: here we have tables for loading Master records, transactional records and configuration data

In staging area 1: here we have the same tables as in Stage 0 but the data model can have small differences. Apart from that, tables for storing error records and status of each run. We call them CNV_LOG and CNV_RUN resp. The job repository tables (discussed in auditing section) have also been stored here.

Staging area 2: This is similar to oracle ERP tables which are loaded with stage 1 records.

61



4.Estimation a conversion

. An overview of the load job designs need to be chalked out.

1.The no of lookups to be performed in the load job. Design of lookup jobs should be explored (scope of any join stage or whether it can be performed using custom SQL in the source oracle stage)

2.The complexity of the transformer in the load job need to be determined. In case of multiple lookups or large number of validations the complexity should be high and the contingency factor in the estimation model can be increased.

3.The existence of mandatory fields (must be loaded in target) should be examined. The records can be rejected at the first opportunity (after source DB stage) and sent to log without any further validation. For non mandatory fields, the records can not be rejected and all the validations on other columns need to be performed.

62



5. Preparing a DS environment

DataStage Installation should be in place along with other database installations

Project Level Environment variables has to be created to hold connectivity values of staging databases, the file locations for input, output and temporary storage.

63



6. Designing Jobs

6.1. General Design Guidelines6.2. Ensuring Restartability6.3. Sample Job Template6.4. Extracting Data6.5. Transforming the Extracted Data6.6. Capturing Rejects6.7. Loading Valid Data6.8. Sequencing the jobs6.9. Job sequence vs Batch Scripts6.10 Tips: Releasing locked Jobs6.11. Mapping multiple stand-alone jobs in one single job6.12 Dataset Management

64



6.1 General Guidelines

Templates have to be created to enhance reusability and enforce coding standard. Jobs should be created using templates.

The template should contain the standard job flow along with proper naming conventions of components, proper Job level annotation and short/long description. Change record section should be kept in log description to keep track.

Don't copy the job design only. copy using save as or create copy option at job level.

The DataStage connection should be logged off after completion of work to avoid locked jobs.

65



6.2 Ensuring Reusability

Creation of common look-up jobs

Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules

Creation of common track jobs

66



6.2 Sample job Template

Below is a sample Job: It contains annotation at the top. The stages have been named as per defined standard. Apart from loading valid data into target table, it will populate two flat files with the information about the failed records.

67



6.4 Extracting Data

1.Use table method for selecting records from source. Provide select list and where clause for better performance

2.Pull the metadata into appropriate staging folders in Table Definitions>Oracle. Always use the Orchdb utility to import metadata. It imports the description part also which is helpful to keep track of the original metadata in case they are modified in the job flow.

3.Avoid using the table name in the form of parameter in oracle stages.

4.In case of some access restricted apps tables, to access the data from oracle stage open command section should be used with the relevant query

5.Native API stages always perform better compared to ODBC stage. So Oracle stage should be used.

68



6.4 Transforming extracted data

• 6.5.1.Performing Lookups

• 6.5.2. Lookup stage Problem

• 6.5.3. Using Transformer

• 6.5.4. Transformer compared to Dedicated stages

• 6.5.5. Tips: Sorting

• 6.5.6. Tips: Removing Duplicates

• 6.5.7. Null Handling

• 6.5.8. When to configure nodes and partitioning

69



Using a Look-up stage:

1.The no of datasets referenced in one lookup stage should be limited depending on the reference table data volume.

2.To capture the failed records and store in a definite format in an error table, the lookup failure condition and condition not met option is set to CONTINUE and hence metadata of all the concerned columns in the output of lookup stage should be made NULLABLE. It performs a left-outer join in this case (source is assumed as left link)

6.5.1 Performing Lookups

70



Lookup Stage problem

While connecting a new lookup stage in an existing flow as in the figure, if we detach any of the link and connect to the new stage and configure the rest of the things, we would not be able to provide a condition based on input link columns as the tab will be disabled.

The reason can be the earlier link fail to recognize the new stage.

The way out is to remove one of the connecting links and connect two fresh links to the stage.

TX

lkp1

lkp2

TX

Flow

71



Using parameters in Transformer:

While passing Job parameters to a target column in transformer stage, Project defaulted parameters can not be directly mapped to a target column. A job level parameter will not cause any problem. Possible solutions are:

1.Create a job level parameter and map it to the actual project level parameter at sequence level is a possible solution.

2.Use GetEnvironment(%envvar%) like GetEnvironment(‘$P_OPCO’)

A parameter can not be used directly inside a stage variable in a Transformer (It will give a compilation error). The alternate strategy to be followed is to use a transformer/column generator stage prior to the validation transformer and insert the parameter value to a dummy field of the output dataset of the first transformer stage. Further calculations can be carried out using that dummy column.

6.5.2 Using Transformer

72



6.5.2 Transformer compared to dedicated stages

A PX Transformer is compiled into a C++ component separately and thus slows down the performance. It is a kind of all-rounder stage and dedicated stages are available for many tasks:

Transformer constraints can be implemented using a filter stage

For metadata conversion, we have modify stage

For dropping columns or to get multiple outputs, we can use copy stage

Counters can be implemented using a surrogate key stage.

These specialized stages are faster as they do not carry much overhead and should be used when no derivations are present.

But these dedicated stages have problems too. In filter stage and modify stage, no syntax check is provided and thus there is no easy way to ensure correct code unless we compile and analyze the error message. So, in many cases using a transformer enhances the maintainability of the code later on and is suggested if performance is not an issue.

73



6.5.4 Tips- Sorting

Sort Stage:

Using sort stage in multi-node environment:

If more than one Logical or Physical nodes are defined, the Sort Stage might give weird results since DataStage arbitrarily partitions the incoming dataset, sorts them separately and writes them to a single dataset. The resolutions are:

1. The safest and the easiest way to solve this problem is to run the Sort stage in Sequential mode. This can be done by selecting ‘Sequential’ option in the Advanced Tab in the Stage page.

2. Partition the dataset using hash key partitioning, by selecting the Hash Key same as the Sort Key. This can be done in the Inputs page Partitioning Tab of the Sort stage. Collect the data with sort/merge collection method.

74



6.5.4 Removing Duplicates

Sort Stage or Remove Duplicate Stage can be used to perform this. To remove the duplicates as well as capture the duplicated rows, remove duplicate stage has to be used.

Capturing rows having duplicate key values:

To select distinct values from the input dataset and also catch the duplicates in a separate file a combination of a Sort stage and a Transformer can be used. In the Properties page of the Sort stage the option of CreateKeyChange is selected to be True. This creates an extra column in the result dataset where this column contains ‘1’ for distinct values of the Sort key and ‘0’ for the duplicate values. This column of the dataset can be used in the Transformer separate the distinct and duplicate values.

75



6.5.5 NULL Handling

Functions such as NulltoZero, NulltoValue, NullToEmpty should be used instead of IsNull if the later one causes problem. For Decimal fields, on failure to lookup, is populated by zero. Care should be taken if the source column can contain zero as well and validation logic should be framed accordingly

The approach would be different for mandatory fields to that of not mandatory fields. Source records containing NULL in mandatory fields can be rejected at the first opportunity by using a Filter stage, where as in case of optional fields, they will be loaded into the target.

The approach can be to check for null using IsNull function or checking for zero length after trimming the column and then explicitly set it to Null using the SetNull function

76



Suppose we are generating a key message with more than one fields which are coming from source. We need to be very careful about that. Because when we are concatenating that field in the key message field and the field contains a null then the record may get dropped, specially if more fields are concatenated after that. Suppose this is our code to generate a key message :

Here the field BANK_NUM is a nullable field

–If len(VarFndBnkNum) <> 0 Then 'Customer ID: ': validateCustSiteUses.ID : ', BANK_ACCOUNT_NUM: ' : validateCustSiteUses.BANK_ACCOUNT_NUM : ', BANK_NUM: ' : validateCustSiteUses.BANK_NUM : ', ORG_ID' : validateCustSiteUses.ORG_ID_LK Else ''

6.5.6 NULL Handling while concatenating error messages

77



–In this case the record containing BANK_NUM = NULL will get dropped. But if we use a NullToEmpty conversion for the field then the code will be perfect, as below :-If len(VarFndBnkNum) <> 0 Then 'Customer ID: ': validateCustSiteUses.ID : ', BANK_ACCOUNT_NUM: ' : validateCustSiteUses.BANK_ACCOUNT_NUM : ', BANK_NUM: ' : NullToEmpty (validateCustSiteUses.BANK_NUM) : ', ORG_ID' :validateCustSiteUses.ORG_ID_LK Else ''

6.5.6 NULL Handling while concatenating error messages

78



In most of the cases, the task of node configuration and partitioning has been left to DataStage ( default Auto) and it partitions the input dataset based on the number of nodes( two in our case: so two partitions)

Customization is required when a join is performed (presort the data before join) or when a sort stage is used (typical cases found till date).

In some cases the stage may need to be restricted to one node so that it creates only one process which will work on the entire dataset e.g. if we need to know no of rows and write a stage variable as below:

svRowCount=svRowCount + 1;

Here if the stage runs on two nodes, it will create two processes which will run on two partitions. So the final count would be half of the entire dataset.

Also applicable for the logic of vertical pivoting in Transformer using stage variables.

6.5.7 When to configure nodes and Partition methods

79



6.6 Capturing Rejects

Capturing Rejected Rows:

The records failing validation or getting rejected from database can be captured in flat files with a definite format (it should contain the field for which it has failed)

Both files can be concatenated and loaded into a database table in a different job. This job can be called after running the load job.

The entries in the log table should refer to the job run entry in the run table.

80



6.7 Loading Valid Data

1.Pull the metadata into proper staging folder in Table Definitions>Oracle

2.Always use the Orchdb utility to import metadata.

3.Avoid using the table name in the form of parameter in oracle stages.

4.Use upsert method for target Oracle stage. Use user defined query. For insert only records make the update SQL make always meet the false condition like (1 = 2 ).

5.Journal fields which are not of any business interest can be populated either in DataStage or using oracle default.

81



6.8 Sequencing the jobs

Job Activity Stage Best Practices:

Avoid putting $PROJDEF in Job Activity Stage mappings:

Many developers do this as this is very time saving approach. If all the project level parameters are mapped as project default in Job Activity stage, it will retrieve the values directly at run time. So, parameter values will not flow from upper level sequence to individual job and hence user can never override any parameter value during test.

Provide the execution action as “Reset if required, then run” so that the sequence can reset aborted subordinate jobs if any before running.

The priority of parameter values is top-down i.e. if a job parameter has been defined in a parallel job with some default value and have been mapped to a sequence level parameter, then the sequence level default value will take precedence at runtime.

82



6.8 Sequencing the Jobs

How to avoid manual mapping of similar Job parameters inside Job Activity stages: A developer Short-cut

If a job name is changed, all the parameter mappings get wiped out. So, for a complete development of an conversion, we need to map the same parameters for each Job activity stage manually. To avoid this the following steps can be followed:

1.Create a sample Sequence job and create one Job activity stage with complete mapping.

2.Copy and paste the stage as many times as the number of Job activity stages needed.

83



6.8 Sequencing the Jobs

Save the job and export it.

Now open the .dsx file in notepad and find the job name.

Start from the bottom of the file and replace the job names with the actual job names till the second last job activity stage (first one is already having proper file name)

Save the dsx and import it in project. Now copy those stages from that sample job in the actual sequence jobs.

84



Sequences have the obvious advantage of GUI and thus can be developed and maintained very quickly

Batches have been a functionality before sequences were introduced. So in many applications batches are the way things are running. Can be better used in case of custom restartability to be ensured

6.9 Sequences Vs Batch Scripts

85



6.10 Releasing locked jobs

Using DataStage Director:

Go to Director

Go to Job Clean Up Resources Click on Show All in processes as well as locks window

Make a note of the PID of the locked job from the bottom window

Select that PID from the processes window and click logout

Refresh

Check the job from Designer

Using UNIX command:

Kill command can be used to unlock the process

86



6.11 Mapping multiple stand-alone job in a single job

The flows are executed in parallel

Advantage: minimised development time compared to sequence-job approach. useful in case a good number of datasets need to be generated to be used later on as lookup

Disadvantage: The job will abort in case one of the flows abort. Also, if the execution time of one flow is higher than other flows, they will be kept waiting unless all the flows finish.

87




We usually use datasets as a reference for performing lookups or during debugging phase of a job by placing a dataset in the output link of a stage.

Points to note: The dataset should be named .ds as suffix. It is the control file which stores the data file names and metadata.

During debugging we usually create many temporary datasets. We can remove the unwanted datasets using the dataset management tool in Director or using putty directly in the AIX server where DS server is installed.

Best Practice Tip:

The default location of dataset data files as in the default. apt (default configuration file) resource disk "C:/Ascential/DataStage/Datasets“. It is a preferred best practice to create a custom configuration file for each project with a separate location provided as resource disk.

88




The easiest way is to enable the “automatically handle activities that fail” option in job properties tab of a sequence job. This allows DataStage to send an abort request to a calling sequence if a subordinate job aborts.

DataStage provides some job control stages e.g. terminator activity stage to further customize the restartability in your job.

89



7 . Troubleshooting

7.1 Troubleshooting: Some debugging Techniques

7.2 Oracle Error Codes in DataStage

7.3 Common Errors and Resolution

7.4 Tips: Message Handler

7.5 Local runtime Message Handling in Director

7.6 Tips: Job Level and Project Level Message Handling

7.7 Using Job Level Message Handler

90



7.1 Troubleshooting- Debugging techniques

Using APT_DUMP_SCORE parameter:

This environment variable is available in the DataStage Administrator

under the Parallel Reporting branch. Configures DataStage to print ➤a report showing the operators, processes, and data sets in a running job.

Using APT_DISABLE_COMBINATION parameter:

Disable the parameter APT_DISABLE_COMBINATION. This environment variable is available in the DataStage Administrator under the Parallel branch. It globally disables operator combining (default behavior: two or more operators within a step are combined into one process where possible). Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory.

91




It helps to determine the exact stage where the error is getting generated e.g. record drop due to null in a function without null handling (otherwise it will throw an AptCombinedOperatorController error)

Using OSH_ECHO: This environment variable is available in the DataStage Administrator

under the Parallel Reporting branch. If set, it causes DataStage to ➤echo its job specification to the job log after the shell has expanded all arguments.

92




Enable the following environment variables in DataStage Administrator:

APT_PM_PLAYER_TIMING – shows how much CPU time each stage uses

APT_PM_SHOW_PIDS – show process ID of each stage

APT_RECORD_COUNTS – shows record counts in log

APT_CONFIG_FILE – switch configuration file (one node, multiple nodes)

OSH_DUMP – shows OSH code for your job. Shows if any unexpected settings were set by the GUI.

Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead.

Use row generator stage to generate sample data.

Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH&

93



7.2 Oracle error codes in Datastage

Some common error codes has been listed for ready reference along with possible remedies to resolve the issues faster.

ORACLE ERROR CODES IN DS

94



7.3 Common errors and resolution

1) AptCombinedOperatorController: NULL found in input dataset. Record dropped:

RESOLUTION: generated if a function inside a transformer is met with a null value without performing null handling (e.g. Concatenating a string with a nullable field)Error occurs if a nullable column is written to a sequential file without null handling properties.

2) ORCHESTRATE step execution terminating due to SIGINT

RESOLUTION: SIGINT is the signal thrown by a computer program (here UNIX OS) when a user wishes to interrupt a process, most likely resulting in extreme resource consumption corresponds to warning limit. It is most likely due to short fall in availability of resource. Following techniques worked on a trial and error basis in a no. of situations:

Increase the warning limit from the Sequence.

Varchar(2000) fields are present in the target. It the column size is decreased, problem can be resolved

95



7.3 Common errors and resolution

3). When checking operator: Operator of type "APT_LUTCreateOp": will partition despite the preserve-partitioning flag on the data set on input port 0.

RESOLUTION: Tells that the job will repartition the data even though the code is telling the job to preserve the partitioning from upstream. Where this is happening open up the stage and set the input link properties to 'Clear partitioning'.

4). When binding input interface field “FIELD1" to field “FIELD2": Converting a nullable source to a non-nullable result; a fatal runtime error could occur; use a modify operator to specify the value to which the null should be converted.

RESOLUTION: As the failure condition is set to CONTINUE, metadata of all the concerned columns in the output of lookup stage should be made NULLABLE.

96



7.4 Message Handler

Local Message Handler :

To suppress unwanted warnings following method can be followed:

Right click to the warning which you want to handle > click on Add to message Handler > Click on Add Rule > A message will be In the next run, the messages will be handled and a consolidated message will be shown

While taking exports, the executables must also be promoted to use these handlers.

Local Runtime message handlers (Local.msh) are stored in RC_SC nnnn folder under the specific project folder ( The path can be found in the Project Pathname in Administrator)

where nnnn is the job number generated from DS JOBS.

97



7.5 Local Runtime Message Handling In Director -1

98




99




100



7.5 Local Runtime Message Handling In Director - 4

101



7.6 Tips : Job Level and Project level Message Handling

Job Level Message Handler :

Allows for a job source only promotion of code, allows messages to be handled for a single job exclusively, puts the message handling in a central location.

There is a folder named MsgHandler DataStage directory When a new message handler is saved, a new .msh file will be created.

To take one project from DEV server to another environment, these message handlers can not be exported directly along with the .dsx file, rather the relevant .msh files need to be copied and saved to the same MsgHandler folder there. Then the job which is exported will allow to compile and the message handler works fine

Project Level Message Handler :

Can be defined from Administrator. Applies to all the jobs in that project

APT_ERROR_CONFIGURATION is a parameter that can be configured to customize the error log

102



7.7- Using Job Level Message Handler-1

103




104




105



8- Preparing UTP - Guidelines

One standard template should be followed for Data Artifacts

Only one consolidated UTP should be kept in Ascendant. In case of enhancements, the addendum UTP should be added creating a new section above open and closed issues section

Test Artifacts should be attached in two spreadsheets for each sequence job. The first should comprise all lookup reference datasets. The second one should comprise source, target, cnv_run, cnv_log and one analysis tab.

Main sequence log can be attached as a bmp file in Appendix.

106



9.Maintenance Activities

9.1 Backup and version control Activity

Taking whole project backup

Taking Job level Export

Taking folder level Export

Version Control in ClearCase

9.2 DS Auditing Activity

Tracking the list of modified jobs during a period


Getting the row counts of different jobs

9.3 Performance Tuning of DS Jobs

Analysing a flow

Measuring Performance

Designing for good performance

Improving performance

9.4 Assuring Naming Conventions of components, jobs and categories


107



9.1 Back Up and Recovery activity

•INTRODUCTION TO THE PROCESS:

• During fresh development phase, each newly built module is backed up after being delivered.

• During test phase, the jobs enhanced each week is identified in the weekend and backed up as a part of version control activity

• During dev phase, whole project backup can be performed weekly or every fortnight. During test phase, whole project backup is performed monthly.

•Feature of the Tool:

• Taking whole project backup from command line automatically.

• Taking Job level and category level export from command line automatically. Identifying the jobs changed during a specified period taking backup for those jobs as a part of version control activity

108



Back up activity

Taking Job level Export:A Job Repository table has been created in stage 1.A sequence job runs to refresh this repository. This sequence calls a routine which extracts the job names and the associated category path into a sequential file. The subsequent load job loads the data into repository.

If some specific categories/jobs has to be exported, then the relevant sql file has to be modified with the required query in the where clause to select the required jobs to be exported.

If the requirement is version control, then the repository of modified jobs has to be refreshed and then the main batch can be run directly to perform the export. It will create job level dsx files. One report file will be generated.

If a job is locked by any user, the utility will cease to proceed further unless the option to skip/abort is provided by the user. So, it is better to restart the server before the export is started. The job level dsx files will be created with the same folder structure as in the server

109



Back up activity

Taking folder level Export:

Once the job level backup is complete, those files can be concatenated to create folder level dsx files.

If some specific categories has to be exported, then the relevant sql file has to be modified with the required query in the where clause to select the required jobs to be exported.

If the requirement is version control, then the repository of modified jobs has to be refreshed and then the main batch can be run directly to perform the export. It will concatenate the job level dsx files created earlier to create folder wise dsx files.

If there exists a log file, the batch will abort. Unlock the job in the server and perform the export batch again to take export of that job. If the export program was successful, folder level dsx files will be generated along with a report file.

110



Version Control

–To upload the dsx into the respective folder in CC–connect to ClearCase web client and go to the proper path–Create the activity indicating the reason of change (defect number)–Check out the respective folder (folder> basic>check out).–Put the .dsx file into the CCRC path in your local machine–Check in the folder and click Tools > update resources with the selected

activity.Add the .dsx file to source control (Right click on the file in the right hand pane > basic > add to source control. A blue background will come up

–uncheck the option for checking out after adding to source control –Right click on the file in the right hand pane > Tools >show version tree.

The version tree will be taken.–To further apply any change to the code–Import the .dsx file to the local machine and make modifications as per

requirement–Compile and run the job and upload the new dsx as discussed

111



9.12 DS Auditing activities

–Tracking the list of modified jobs during a period–Assuring Naming Conventions of components, jobs and

categories–Retrieving Job Statistics

112



Assuring naming convention of component and jobs

–A pl/sql procedure to ensure the naming conventions of jobs, stages and links, categories can be used. It can generate the report of components not matching with the specified convention.

–If MetaStage can be used to export DataStage system tables to an RDBMS e.g. Oracle via a metabroker, then the procedure can be run on the tables to validate the standards

113




A very important aspect of auditing activity in case of data migration. This is

ensured in two phases.

•First is to retrieve the record counts for source, records inserted or updated into target table, records failed business rule validation and records rejected by oracle. This is done using a routine written in DS basic which retrieves record counts by searching for links with some specific keywords. These keywords refer to the links from source, to target or the failure links in the load job. These information are stored in CNV_RUN TABLE

•A second approach retrieves those job names for which number of source records do not match with the combined value of inserted records and failed records( hence some records have been dropped somewhere in the flow)

114



9.3 Performance tuning of DS Jobs

–Analysing a flow

–Measuring Performance–Designing for good performance–Improving performance

115



9.3 Performance tuning of DS Jobs : Purpose

• The document describes the process towards analysing a job flow and measuring its performance based on certain project benchmark. Further, it suggests steps to improve the performance of the identified jobs. Important to mention that, performance tuning is not a subject that too much time should be spent on during the initial design. That is to say unless it is clear that performance will be an issue, it may well be that the performance is adequate without having to carry out any of these tuning options, and you will therefore save yourself time – not having to implement these changes.

116



Performance tuning of DS Jobs : Analysing the flow

1.A score dump of the job helps to understand the flow. We can do this by setting the APT_DUMP_SCORE environment variable true and running the job (APT _DUMP_SCORE can be set in the Administrator client, under the Parallel > Reporting branch). This causes a report to be produced which shows the operators, processes and data sets in the job.

–The report includes information about:– Where and how data is repartitioned.–Whether DataStage had inserted extra operators in the flow.–The degree of parallelism each operator runs with, and on which nodes.–Information about where data is buffered.

117



Performance tuning of DS Jobs : Analysing the flow

–The score dump is particularly useful in showing you where DataStage is inserting additional components in the job flow. In particular DataStage will add partition and sort operators where the logic of the job demands it. Sorts in particular can be detrimental to performance and a score dump can help you to detect superfluous operators and amend the job design to remove them.

2.Runtime Information: When you set the APT_PM_PLAYER_TIMING environment variable, information is provided for each operator in a job flow. This information is written to the job log when the job is run. It is often useful to see how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it may mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.

3.Setting the environment variable :APT_DISABLE_COMBINATION may be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care.

118



Performance tuning of DS Jobs : Measuring Performance

–We are Measuring performance using the following ways.

–If the target is a database e.g. Oracle in our case, replace the database stage with a sequential file and see whether it takes the same time. This would give us a know-how whether the database connection to the target (as it is a remote connection) is slow or the volume of data is huge hence it takes time.

– In the transformations section - Invalidate all transformations to default values. This would help us know whether the job is running slow because of transformations

–If the source is a database e.g. Oracle in our case, then the query should be run using hints/partition/index. This would give an insight whether the source query is a bottleneck..

119



Performance tuning of DS Jobs : Measuring Performance

Check for any aggregator stage in your jobs - This is part of transformation bottleneck but need to be given special attention. An aggregator stage in the middle of a big job makes the enter job slow since all the records need to pass the aggregator (cannot be processed in parallel).

To catch partitioning problems, run your job with a single node configuration file and compare the output with your multi-node run. You can just look at the file size, or sort the data for a more detailed comparison

120



Performance tuning of DS Jobs : Improving Performance

–Basic steps:

–Removing unwanted columns at the first opportunity

–Reducing number of rows processed at the earliest. This can be done by placing the transformer constraint or filter where clause in the source oracle stage

– Eliminate Transformers with modify stages where the transformations are simple. Modify, due to internal implementation details, is a particularly efficient operator. Any transformation which can be implemented in the Modify stage will be more efficient than implementing the same operation in a transformer stage. Transformations that touch a single column (for example, keep/drop, type conversions, some string manipulations, null handling) should be implemented in a Modify stage rather than a Transformer.

121




–Consider using Oracle bulk loader instead of upsert method wherever applicable.

–Instead of creating multiple standalone flows in a single job, creating separate jobs and calling them parallels using a sequencer stage can improve the performance.

–If data is going to be read back in, in parallel, it should never be written as a sequential file. A data set or file set stage is a much more appropriate format.

122




Advanced steps:

Running the jobs which handle small volume of data to a single node instead of multiple nodes. This will limit spawning up multiple processes and partitions when there is no need. This can be done by adding the environment $APT_CONFIG_FILE and setting it to use a single node configuration.

When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). Ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O.

123



–Regular Cleanup of log files

–Periodic clean up of &PH& folder. If the time between when a job says it is finishing, and when it actually ends, increases, this may be a symptom of a too full &PH& folder. One way to do this is in DataStage Administrator, select the projects tab, click your project, then press the Command button, enter the command CLEAR.FILE &PH&, and press the execute button. Another way is to create a job with the command: EXECUTE "CLEAR.FILE &PH&" on the job control tab of the job properties window. It may be scheduled to run weekly, but at a point in your production cycle where it will not delete data critical to debugging a problem. &PH& is a project level folder, so this job should be created and scheduled in each project.

–Cleaning up persistent datasets periodically. Datasets should not be used for long tem storage, thus the temporary datasets can be cleaned up. A script can be scheduled to automate the process.


124



9.5 Customised Code

Options:

–Create a basic routine and use it as before/after job subroutine or using a routine activity stage.

–Create a C++ routine and use it inside a PX transformer

–Create custom operators and use them as a stage: This allows knowledgeable Orchestrate users to specify an Orchestrate operator as a DataStage stage. This is then available to use in DataStage Parallel jobs



be placed here)




Course Title

Module 4 : Version Control





Module Objectives

At the completion of this chapter We should be able to:

– Manage and track all DataStage component code changes and releases.

– Maintain an audit trail of changes made to DataStage project components, and records a history of when and where changes e made.

– Store different versions of DataStage jobs.

– Run different versions of the same job.

– Revert back a previous version of a job.

– Store all changes in one centralized place.




Version Control : Agenda Topic 1 :Versioning Methodology

– Discipline.

– Basic Principle/Approach.

– Different Projects.

Topic 2: Initializing Components

– Version Control Numbering.

– Filtering Components.

Topic 3: Promoting Components

– Component selection for promotion.

– Different Methods.

Topic 4: Best Practices

– Using of Custom Folder in Version Control.

– Starting of Version Control from DS-Designer.




In a typical enterprise environment, there may be many developers working on jobs all at different stages of their development cycle. Without version control, effective management of these jobs could become very time consuming and they could be difficult to maintain.

It gives an overview of the methodology used in Version Control and highlight some of its benefits. It is not intended as a comprehensive guide to version control management theory

Benefits:

Version tracking - archiving and versioning (i.e. release-level tracking) of DataStage related components which can be retrieved for bug tracking and other purposes.

Central code repository - all coding changes are contained in one central managed repository, regardless of project or server locations.

DataStage integration - Components are stored within the‘ VERSION’ project, which can be opened directly in DataStage from Version Control. Alternatively, Version Control can be opened directly from within any DataStage client.

Team coordination - Components are marked as read-only as they are processed through Version Control, ensuring that they cannot be modified in any way after being released.

Versioning Methodology




Discipline :

To gain the maximum benefit from using Version Control We must exercise a disciplined approach. If We build in that discipline from the start We will quickly realize the benefits as project grows.

Always ensure that We pass components though Version Control before sending them to their next stage of development. This will make the project development far easier to track, especially if We have complex projects containing a large number of jobs

Basic Principle/Approach :

Most DataStage job developers adopt a three stage approach to developing their DataStage jobs, which has become the de facto standard.

These stages are:

The Development stage

The Test stage

The Production stage







Scenario without Version control

In this model, jobs are coded in the development environment, sent for test, redeveloped until testing is completed, and then passed to production.

There is no central management system to control the flow between the development, test and production environments.

We need to think of Version Control as a central hub where all DataStage projects pass through.

Adopting a staged approach to project development, projects can pass from one stage into Version Control before passed to the next stage.






Scenario with Version control

Whilst in Version Control,

Projects will have the appropriate versioning information added.

This information will include version number, history, and notes.

Consistency of the code across different environment is maintained




Different Projects :

The Version Control Project - Version Control uses a special DataStage project as a repository to store all projects and their associated components. This project is usually called ‘VERSION’, although We may create a project with any name. Whatever name We choose for version project, the principle remains the same. the Version Control repository contains the archive of all components initialized into it. It therefore stores every level of each code release for each component.

Other Projects-If We adopt the three stage approach, We would typically have three other projects:

Development- Where DataStage jobs and associated components are developed.





Test- Where developed jobs and components are tested.

Production- the final destination from where the finished jobs are actually run.

These projects can reside on different DataStage servers if required. Once a development cycle is complete, components are initialized from the Development project into the Version Control repository. From there they are promoted to the Test project. When testing is complete (which may include more development-test cycles), components are promoted from the Version Control repository to the Production project.

Contd…




Initialization is the process of selecting components from a source project and moving them into Version Control for processing and promoting to a target project.

When initializing components, the source project is the development project.

After they have been initialized and processed in Version Control, components are promoted to a test or production project.

Initializing components gives them a new release version number.

Initializing Components




Version Control Numbering:

The full version number of a DataStage component is broken down as follows:

Release Number. Minor Number

where:

The Release Number is allocated when We initialize components in Version Control. If required We can specify a release number in the Initialize Options dialog box. By default, Version Control sets this to the highest release number currently used by objects in its repository.

The Minor Number is allocated automatically by Version Control when We initialize a component. It will increment by one each time We initialize a particular component until We increase the release number.





Filtering Components:

We can filter a long list of components to show only those that we are interested in for promotion.

For example, We may want select components associated with ‘Sales’ or ‘Accounting’. Rather than search through the entire list, We can filter through the list, and select the subset for promotion.





To filter components:

1. Click the Filter button in the Display toolbar so that a text entry field appears:

2. In the text entry field, type in the text we want to filter by.

we can type letters or whole words, and separating letters or words with a comma will result in an ‘OR’ operation. For example, typing in ‘accounting, sales’ will result in a list showing components that have ‘accounting’ or ‘sales’ in its name.

Click the arrow next to the Filter button to specify whether the filter is case sensitive or not.

3. When we are happy with our filter text, click the Filter execute button, press return, or click in the tree view of the Version Control window.

4. To return to the default view, click the Filter button again.





We can promote components after they have been initialized into Version Control.

In a typical environment, components are initialized from a development project and promoted to a test or production project.

Component selection for promotion: We can select components for promotion in the following ways:

By individual selection

By batch

By user

By server

By project

By release

By date

Promoting Components




The different ways of selecting component for promotion are as follows:

By individual selection: We can select components for promotion in the

tree view from any view mode. Individual component selection is suitable when we are promoting a small number of components. The more usual scenario is to use Release/Batch Selection.

By batch: When we initialize a group of components into Version Control, the selected group is known as a ‘batch’. By default batches are identified by the date and time they were initialized, but we always prefer to specify a name for a batch. Version control allows us to select components for selection by initialization batch, promote batch, or named batch. Selecting components by batch automatically highlights all the components of that batch and so selects them for promotion.

By date: We can select components that were initiated on a particular date. All the components that were initialized on that date are selected ready for promotion.





By user: We can select components that have been initialized by a particular user. Select the required user from the menu. All the components that have been initialized by that user are selected ready for promotion.

By server: Select the required server from the menu. All the components that have been initialized from that server are selected ready for promotion.

By project: We can select components that have been initialized from a particular project. Select the required project from the menu. All the components that have been initialized from that project are selected ready for promotion.

By release: We can select components that belong to a particular release. All the components that belong to that release are selected ready for promotion.

By date: We can select components that were initiated on a particular date. All the components that were initialized on that date are selected ready for promotion.





Best Practice

Using of Custom Folder in Version Control:

Many development projects which use DataStage for extraction, transformationand loading (ETL) also incorporate other project related files which are not part of the DataStage repository.

These files may contain DDL scripts or other resource data. Version Control can process these ASCII files in the same way as it processes DataStage components.

If we choose to add Custom folders, they are automatically created by Version Control - there is no need to create them manually.

Every time Version Control subsequently connects to a project, either for initialization or for promotion, it checks to see if the custom folder exists. If it does not exist, then Version Control will create it.

After Version Control has created a custom folder, it can then be populated with the relevant items.

The only requirement for using custom folders in Version Control is that the components must be stored within a folder in the project itself.




Best Practice

Starting of Version Control from DS-Designer :

We can run Version Control directly from within DataStage Designer, Director or Manager by adding a link to the DataStage client tools menu.We can also add options which will allow Version Control to start without displaying the login dialog.

If We want Version Control to start with login details already filled in and without display the login dialog, We can enter appropriate command line arguments. These are entered in the Arguments field and have the following syntax:

/H=hostname /U=username /P=passwordwhere:hostname is the DataStage Server hosting projectusername is DataStage usernamepassword is DataStage password.For example, if We have a hostname of ‘ds_server’, a username of ‘vc_user’, and a password of ‘control’, then We would type in:/H=ds_server /U=vc_user /P=control

Version Control can now be started from the DataStage Client.




Questions and Answers

bi tookit -datastage v1.0

Documents