data stage job design approach

Upload: vamsi-karthik

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Data Stage Job Design Approach

    1/18

    Job Design Approach

  • 7/30/2019 Data Stage Job Design Approach

    2/18

    2002. Infosys Technologies Ltd. 2

    Agenda

    Introduction

    Framework

    Scheduling Approach

    Restart Ability

    Reusability

    Templates

    Modularity and Maintain Ability

    Performance Considerations

  • 7/30/2019 Data Stage Job Design Approach

    3/18

    2002. Infosys Technologies Ltd. 3

    Introduction

    Job design will be influenced by following points.

    Framework

    Scheduling Approach

    Restart Ability

    Reusability/Templates

    Modularity and Maintain Ability

    Performance Considerations

    Metadata Management

  • 7/30/2019 Data Stage Job Design Approach

    4/18

    2002. Infosys Technologies Ltd. 4

    Framework

    Reprocessing

    System Health Tables

    ACR Balancing

    Logs , Errors & Warnings

  • 7/30/2019 Data Stage Job Design Approach

    5/18

    2002. Infosys Technologies Ltd. 5

    Framework

    Reprocessing - Records will be error out according to business rules definedand records should be reconsidered when the Job runs in next run

    Reprocessing will be required/enforced, if the quality of data is not goodenough.

    Reprocessing will influence Jobs Design/Framework in many ways

    Error records need to be retained to allow corrections, need for landing/work table

    Job should have logic to handle duplicate records with same natural key

    ACR log file should accommodate the count of reprocessed records

    End users should be able to identify error records and correct

  • 7/30/2019 Data Stage Job Design Approach

    6/18

    2002. Infosys Technologies Ltd. 6

    Framework

    System Health Tables Jobs should provide necessary information to maintain, track, and control data loading.

    System Health Tables will have data of start and end time of a Job, # of recordsread, # of records written, # of records bypassed, Start of Batch , end of batch.

    System Health Tables will directly/indirectly influence Jobs Design/Framework

    To have necessary files generated with necessary information

    To have enough information like link counts etc.

    Reusable and Common jobs will be identified

    Scheduling and Sequencing will be influenced

  • 7/30/2019 Data Stage Job Design Approach

    7/18 2002. Infosys Technologies Ltd. 7

    Framework

    Few Common Tables from CSL/ABI projects

    DTMT_PRCS: Stores information about business processes.

    DTMT_PGM_CNTL: Stores all control table entries.

    DTMT_PGM_ERR: Stores information about errors occurred during program

    execution.

    DTMT_PGM_EXEC_H: Stores Execution history of every program execution

    DTMT_REC_ERR_LOG (Staging table): Staging table for error records to becorrected

    DTMT_SRC: Contains Source file names

    DTMT_PGM: Contains details about all the programs

  • 7/30/2019 Data Stage Job Design Approach

    8/18 2002. Infosys Technologies Ltd. 8

    Framework

    Logs, Errors, Warning : Datastage jobs should have provisions to maintainslogs, Errors and Warnings

    Logs are required to facilitate in debugging and keep track

    Errors and Warning need to be logged to validate business rules and datavalidations

    Restart Ability will play vital role in loading Errors and Warning.

    Reusability/Common Jobs can be identified

  • 7/30/2019 Data Stage Job Design Approach

    9/18 2002. Infosys Technologies Ltd. 9

    Scheduling

    Scheduling approach will effect the Job designs.

    Scheduling can be done in two approaches

    Use Sequencers of DataStage for Sequencing the Job. Use Control M only forScheduling. Sequences should be build with restart points

    Pros : Sequencing Complexity Abstracted inside Sequencers.

    Pros : Scheduling will be simplified only Starting point

    Cons : Complexity and additional effort in building sequencers. Sequencing and Job Designstightly coupled

    Use Control M for sequencing and scheduling . Break the functionality required intoRestartable jobs and use Control M for sequencing and scheduling

    Pros : Simplified Job Design and Sequencing and Job Designs are loosely coupled

    Pros : Flexibility to break/join jobs without major effect on sequencing. No additional overhead of

    maintaining Restartable points

    Cons : Complexity of sequencing is shifted to scheduling.

  • 7/30/2019 Data Stage Job Design Approach

    10/18 2002. Infosys Technologies Ltd. 10

    Scheduling Sequencer Approach

  • 7/30/2019 Data Stage Job Design Approach

    11/18 2002. Infosys Technologies Ltd. 11

    Scheduling Control M Approach

    The scheduling of jobs/scripts in a project is done through Cntl-m.

    The dependency between jobs within the same module or across themodules (successor/predecessor) are tracked in an xls and is submitted tothe cntl-m team

    The dependency of the jobs is set up in the cntl-m using triggers, so that ajob starts execution only after all its predecessors completed their executionsuccessfully

    The trigger can be the successful completion of a job, presence of aparticular file, etc.

    Sample Control M excel attached

    ApplicationDescription ofRequest

    Test Prod 04/01/2005

    TableName

    JobName

    ActionRequested

    DaysScheduled Holi days Dependencies

    TimeWindowforJobStart

    (Iftable exists) (Ifjob exists)Add, Change,

    D el et e T es t P ro d(M,T,W,Th,F,Sa,

    Su)(job namesorline

    number ) (opti onal )

    START_OF_CYCL

    E ADWGR0010T Addgrmetltest/

    adwgradm F 2am

    START_OF_CYCL

    E ADWGR0020T Addgrmetltest/

    adwgradm F ADWGR0010T

    START_OF_CYCLE ADWGR0080T Add

    grmetltest/adwgradm F ADWGR0020T

    LANDING_JOBS ADWGR1005T Addgrmetltest/adwgradm F ADWGR0080T

    LANDING_JOBS ADWGR1005B Addgrmetltest/

    adwgradm F ADWGR1005T

    LANDING_JOBS ADWGR1005L Changegrmetltest/

    adwgradm F ADWGR1005B

    LANDING_JOBS ADWGR1008T Addgrmetltest/

    adwgradm F ADWGR0080T

    LANDING_JOBS ADWGR1008B Addgrmetltest/

    adwgradm F ADWGR1008T

    LANDING_JOBS ADWGR1008L Changegrmetltest/adwgradm F ADWGR1008B

    LANDING_JOBS ADWGR1010T Addgrmetltest/

    adwgradm F ADWGR1008L

    LANDING_JOBS ADWGR1010B Addgrmetltest/

    adwgradm F ADWGR1010T

    LANDING_JOBS ADWGR1010L Changegrmetltest/

    adwgradm F ADWGR1010B

    LANDING_JOBS ADWGR1015T Addgrmetltest/adwgradm F ADWGR0080T

    LANDING_JOBS ADWGR1015B Addgrmetltest/adwgradm F ADWGR1015T

    LANDING_JOBS ADWGR1015L Changegrmetltest/

    adwgradm F ADWGR1015B

    LANDING_JOBS ADWGR1020T Addgrmetltest/

    adwgradm F ADWGR0080T

    RequesterName B ri an Tu rb es A DW G if t R eg is tr yContactInformati on 612-304-0476, [email protected] NewjobsetupforapplicationADWGRRequested Migration Date 2/10/2005

    Server/ AccountPath Name, ScriptName, Parameters

    /opt/scripts/te st/adwetlrun.ksh -fADWGR0010T_parms.dat ADWGRADWGR0010TtableEtlPrcsGrpADWGR0010Tadwgrcur/opt/scripts/te st/adwetlrun.ksh -f

    ADWGR0020T_parms.dat ADWGRADWGR0020TtableEtlPrcs ADWGR0020Tadwgrcur

    /opt/scripts/test/adwacrrun.ksh ADWGR1005BADWGR1005B ADW3407 adwgrcurADWGR

    /opt/scripts/test/adwetlrun.ksh -f

    ADWGR1005L_parms.datADWGRADWGR0030TtableEtlSubPrcs.ADWGR1005

    ADWGR1005L adwgrcur

    /opt/scripts/test/adwetlrun.ksh -f

    ADWGR1005T_parms.datADWGRADWGR1005TtableGftrgE ADWGR1005T

    /opt/scripts/te st/adwetlrun.ksh -fADWGR0080T_parms.dat ADWGRADWGR0080TtablePrcsCntl ADWGR0080Tadwgrcur

    /opt/scripts/test/adwetlrun.ksh -fADWGR1008T_parms.datADWGR

    ADWGR1008Tdss1008GftrgCustADWGR1008T

    /opt/scripts/test/adwetlrun.ksh -fADWGR1008L_parms.datADWGR

    ADWGR0030TtableEtlSubPrcs.ADWGR1008ADWGR1008L adwgrcur

    /opt/scripts/test/adwetlrun.ksh -f

    ADWGR1010T_parms.datADWGRADWGR1010TtableGftrgCustE ADWGR1010T

    adwgrcur/opt/scripts/test/adwacrrun.ksh ADWGR1010B

    ADWGR1010B ADW3409 adwgrcurADWGR

    /opt/scripts/test/adwacrrun.ksh ADWGR1008B

    ADWGR1008B ADW3401 adwgrcurADWGR

    /opt/scripts/test/adwetlrun.ksh -f

    ADWGR1015T_parms.datADWGR

    ADWGR1015TtableGftrgBabyE ADWGR1015Tadwgrcur

    /opt/scripts/test/adwacrrun.ksh ADWGR1015BADWGR1015B ADW3402 adwgrcurADWGR

    /opt/scripts/test/adwetlrun.ksh -fADWGR1015L_parms.datADWGR

    ADWGR0030TtableEtlSubPrcs.ADWGR1015

    ADWGR1015L adwgrcur/opt/scripts/test/adwetlrun.ksh -f

    ADWGR1020T_parms.datADWGR

    ADWGR1020TtableGftrgCharE ADWGR1020T

    /opt/scripts/test/adwetlrun.ksh -f

    ADWGR1010L_parms.datADWGR

    ADWGR0030TtableEtlSubPrcs.ADWGR1010ADWGR1010L adwgrcur

  • 7/30/2019 Data Stage Job Design Approach

    12/18 2002. Infosys Technologies Ltd. 12

    Restart Ability

    Restart Ability will influence Job Designs in breaking up Jobs

    Restart Ability is very important in ETL Jobs and each Job should be restart able

    Restart Ability will play vital role in

    Loading tables with History

    Sequence Number Generation

    Reprocessing

    Loading Errors/Warning Tables

    Loading System Health Tables

    If Sequencers are used for sequencing Sequencer Routines and Shell scripts will beplace holders to maintain restartable points

    If Control M is used for sequencing , breaking of Jobs/Identifying Common Jobs is key

  • 7/30/2019 Data Stage Job Design Approach

    13/18 2002. Infosys Technologies Ltd. 13

    Reusability

    Reusability is very imp in Software projects

    DataStage allows reusability in following forms

    Shared Containers

    Build Ops

    Common Jobs

    Routines

    Templates

    Shared Containers are best form of reusability on DataStage. Typical Examples that are probablefor usage of Shared Container are

    Sequence Id Generation Logic

    Errors/Warning Generation/Loading

    Loading Landing tables with common functionalities

    Common Business Rules & Logic

    A Container is a group of stages and links which will perform a particular task. The container replaces the complexlogic into one unit and acts as a stage.

  • 7/30/2019 Data Stage Job Design Approach

    14/18 2002. Infosys Technologies Ltd. 14

    Reusability

    Build Ops provide Flexibility to write own logic

    Build Ops can be used to obtain common functionality within/across modules , if logicto achieve that functionality using DataStage stages is complex.

    Code-ease: Handling complex conditions, say, many nested if-else statements orhandling many stage variables and their computation is much easier in BuildOp thanTransformer stage.

    Coding-liberties: BuildOp allows the use of data-structures like arrays and string, loop-statements like for and while loops and many other normal coding paradigms. It alsoallows use of various header files and their built-in functions. For ex: Include string.hand it provides you with function APT_String, which can be used for string declarationsand other string operations. All the above mentioned coding features are otherwise notease to use in DataStage.

  • 7/30/2019 Data Stage Job Design Approach

    15/18 2002. Infosys Technologies Ltd. 15

    Reusability

    Common Job will perform common tasks across project/modules taking different parameter todifferent context

    Common Jobs should be run in Multiple Instance to allow multiple instances in parallel Routines will help in performing Pre Job Initiation and Post Job Initiation activities like Copying

    Input files to different directories, ACR File generation , Log Files Etc.

    Clarity in defining activities between Shell Scripts, DataStage Job , Routines ,Sequences,Generic Shell Script is key having clean separation and consistency across project.This will influence the Job Designs

    The job template should contain generic Annotations which would act as a guideline while creatingthe jobs

    All the parameters that are common across all the jobs should be defined in the job templates

    Specific stage properties that are common or mandatory to be set, should be defined in the jobtemplates

    Templates will act as Design Pattern/Guideline in achieving consistency and strict enforcement ondos and donts

    Identifying common patterns and defining templates will achieve consistency

    Few reusable components will evolve as we progress in project , but enough exercise should bedone to bring out reusable components. Piloting a module will also be another option in brining outreusable components

  • 7/30/2019 Data Stage Job Design Approach

    16/18 2002. Infosys Technologies Ltd. 16

    Modularity and Maintainability

    Modularity and Maintainability is another influencing factor in Job Designs

    Reusable Components and Restart Ability will bring the required Modularity andMaintainability

    A proper optimization need to be achieved between Modularity and I/O operations in aJob, keeping Restart Ability into consideration

    Performance Considerations and Maintainability should be properly balanced. For Ex,Reducing # of Transformers in a Job will enhance the performance , but not at the cost

    of its maintainability.

  • 7/30/2019 Data Stage Job Design Approach

    17/18 2002. Infosys Technologies Ltd. 17

    Performance Considerations

    Identifying correct stage for required functionality is key in Job Design

    Sequencing of stages in Job design should be decided keeping the performanceconsiderations. For ex avoid repartitioning

    Usage of temporary tables/worktables/datasets may enhance the performance byreducing load on Jobs, which will influence Job Design

    Make sure all the necessary environment variables are part of template , which caninfluence performance

    Consider volume of data while deciding the stage.

    Detailed points , which can influence performance of Job are covered in performancetuning

  • 7/30/2019 Data Stage Job Design Approach

    18/1818

    Metadata Management

    Job design will be influenced by Metadata Management Considerations

    Jobs should not be driven by Reject Links.

    To avoid reject links, Looks should have dummy column selected from reference link andshould be checked in next stages like transformer.