(kajal maam)etl

Upload: rahulmhatre26

Post on 08-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 (kajal maam)ETL

    1/32

    Extract Transform Load Cycle

  • 8/7/2019 (kajal maam)ETL

    2/32

  • 8/7/2019 (kajal maam)ETL

    3/32

    Challenges in ETL process-ETL functions are challenging because of the nature of source systems

    Diverse and disparate Different operating systems/platforms May not preserve historical data Quality of data may not be guaranteed in the older

    operational source systems Structures keep changing with time Prevalence of data inconsistency in the source system. Data may be stored in cryptic form

    Data type, format,naming convention may be different

  • 8/7/2019 (kajal maam)ETL

    4/32

    Steps for ETL process

    Determine all the target data needed All the data sources,both internal/external Prepare data mapping for target data elements from

    sources Determine data transformation and cleansing rules Plan for aggregate tables Organize data staging area and test tools Write procedures for all data loads

    ETL for dimension tables ETL for fact tables

  • 8/7/2019 (kajal maam)ETL

    5/32

    The ETL Process

    Capture/Extract Scrub or data cleansing

    Transform Load

    ETL = Extract, transform, andload

  • 8/7/2019 (kajal maam)ETL

    6/32

    The ETL Process

    SourceSystems

    Extract Transform

    StagingArea

    Load

    PresentationSystem

  • 8/7/2019 (kajal maam)ETL

    7/32

    Data Extraction Often performed by COBOL routines

    (not recommended because of high programmaintenance and no automatically generated

    meta data) Sometimes source data is copied to the target

    database using the replication capabilities ofstandard RDMS (not recommended because of

    dirty data in the source systems) specialized ETL software

  • 8/7/2019 (kajal maam)ETL

    8/32

    Data Extraction Techniques

    Immediate Data Extraction(real time)-capture through transaction logs-capture through database triggers-capture through source applications Deferred data Extraction(capture happens later)-capture based on data and timestamp

    -capture by comparing files

  • 8/7/2019 (kajal maam)ETL

    9/32

    Capture through transaction logs

    Does not provide much flexibility forcapturing specifications

    Does not affect the performance of sourcesystems

    Does not require any revisions to theexisting source applications

    Cannot be used on file oriented system.

  • 8/7/2019 (kajal maam)ETL

    10/32

    Capture through database triggers

    Does not provide much flexibility forcapturing specifications

    Does not affect the performance of sourcesystems

    Does not require any revisions to theexisting source applications

    Cannot be used on file oriented system. Cannot be used on a legacy system

  • 8/7/2019 (kajal maam)ETL

    11/32

    Capture in Source Application

    Provides flexibility for capturingspecification

    Does not affect the performance of sourcesystems

    Requires the existing source systems tobe revised

    Can be used on a file oriented system Can be used on a legacy system

  • 8/7/2019 (kajal maam)ETL

    12/32

    Capture based on date andtimestamp

    Provides flexibility for capturingspecification

    Does not affect the performance of sourcesystems

    Requires the existing source systems tobe revised

    Can be used on a file oriented system Cannot be used on a legacy system

  • 8/7/2019 (kajal maam)ETL

    13/32

    Capture by comparing files

    Provides flexibility for capturingspecification

    Does not affect the performance of sourcesystems

    Does not require the existing sourcesystems to be revised

    may be used on a file oriented system may be used on a legacy system

  • 8/7/2019 (kajal maam)ETL

    14/32

    Data Cleansing

    Source systems contain dirty data that must becleansed

    ETL software contains rudimentary data

    cleansing capabilities Specialized data cleansing software is often

    used. Important for performing name andaddress correction and householding functions

    Leading data cleansing vendors include Vality(Integrity), Harte-Hanks (Trillium), and Firstlogic(i.d.Centric)

  • 8/7/2019 (kajal maam)ETL

    15/32

    Reasons for Dirty Data

    Dummy Values Absence of Data Multipurpose Fields

    Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules

    Reused Primary Keys, Non-Unique Identifiers Data Integration Problems

  • 8/7/2019 (kajal maam)ETL

    16/32

  • 8/7/2019 (kajal maam)ETL

    17/32

    Parsing

    Parsing locates and identifies individualdata elements in the source files and thenisolates these data elements in the targetfiles.

    Examples include parsing the first, middle,and last name; street number and street

    name; and city and state.

  • 8/7/2019 (kajal maam)ETL

    18/32

    Correcting

    Corrects parsed individual datacomponents using sophisticated dataalgorithms and secondary data sources.

    Example include replacing a vanityaddress and adding a zip code.

  • 8/7/2019 (kajal maam)ETL

    19/32

    Standardizing

    Standardizing applies conversion routinesto transform data into its preferred (andconsistent) format using both standardand custom business rules.

    Examples include adding a pre name,replacing a nickname, and using a

    preferred street name.

  • 8/7/2019 (kajal maam)ETL

    20/32

    Matching

    Searching and matching records withinand across the parsed, corrected andstandardized data based on predefinedbusiness rules to eliminate duplications.

    Examples include identifying similarnames and addresses.

  • 8/7/2019 (kajal maam)ETL

    21/32

    Consolidating

    Analyzing and identifying relationshipsbetween matched records andconsolidating/merging them into ONErepresentation.

  • 8/7/2019 (kajal maam)ETL

    22/32

    Data Transformation

    Transforms the data in accordance withthe business rules and standards thathave been established

    Example include: format changes, duplication, splitting upfields, replacement of codes, derived values, andaggregates

    Deals with rectifying any inconsistency

  • 8/7/2019 (kajal maam)ETL

    23/32

    Attribute naming inconsistency issueonce all the data elements have right namesthey must be converted into commonformats. Data format has to be standardized All the transformation activities are

    automated Tool: DataMapper

  • 8/7/2019 (kajal maam)ETL

    24/32

    Basic tasks in Transformation

    Selection Splitting/joining Conversion Summarization

  • 8/7/2019 (kajal maam)ETL

    25/32

    Data Loading

    Data are physically moved to the datawarehouse

    The loading takes place within a timewindow

    The trend is to near real time updates ofthe data warehouse as the warehouse is

    increasingly used for operationalapplications

  • 8/7/2019 (kajal maam)ETL

    26/32

    Different modes in which data can be

    applied to the warehouse Load Append Merge

  • 8/7/2019 (kajal maam)ETL

    27/32

    Loading Techniques

    Initial load Incremental load Full refresh

  • 8/7/2019 (kajal maam)ETL

    28/32

    Sample ETL Tools

    Teradata Warehouse Builder from Teradata DataStage from Ascential Software SAS System from SAS Institute

    Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird

    Communications

  • 8/7/2019 (kajal maam)ETL

    29/32

    Steps in data reconciliation

    Static extract = capturing asnapshot of the source dataat a point in time

    Incremental extract =capturing changes that haveoccurred since the laststatic extract

    Capture = extractobtaining asnapshot of a chosen subset of thesource data for loading into the datawarehouse

  • 8/7/2019 (kajal maam)ETL

    30/32

    Steps in data reconciliation (continued)

    Scrub = cleanseuses patternrecognition and AI techniques toupgrade data quality

    Fixing errors: misspellings,erroneous dates, incorrect field usage,mismatched addresses, missing data,duplicate data, inconsistencies

    Also: decoding, reformatting, timestamping, conversion, key generation,merging, error detection/logging,locating missing data

  • 8/7/2019 (kajal maam)ETL

    31/32

    Steps in data reconciliation (continued)

    Transform = convert data from formatof operational system to format ofdata warehouse

    Record-level:Selection data partitioningJoining data combiningAggregation data summarization

    Field-level:single-field from one field to one fieldmulti-field from many fields to one, orone field to many

  • 8/7/2019 (kajal maam)ETL

    32/32

    Steps in data reconciliation (continued)

    Load/Index= place transformed datainto the warehouse and createindexes

    Refresh mode: bulk rewriting oftarget data at periodic intervals

    Update mode: only changes insource data are written to datawarehouse