etldesignmethodologydocument-ssis

15
ETL Design Methodology Document Microsoft SQL Server Integration Services Durable Impact Consulting, Inc. www.durableimpact.com Version 1.0 Wes Dumey Copyright 2006 Protected by the ‘Open Document License’

Upload: achvm

Post on 24-Oct-2014

787 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ETLDesignMethodologyDocument-SSIS

ETL Design Methodology DocumentMicrosoft SQL Server Integration Services

Durable Impact Consulting, Inc.www.durableimpact.com

Version 1.0Wes Dumey

Copyright 2006Protected by the ‘Open Document License’

Page 2: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Document Licensing Standards

This document is protected by the copyright laws of the United States of America. In order to facilitate development of an open source ETL Methodology document, users are permitted to modify this document at will, with the expressed understanding that any changes that improve upon or add to this methodology become property of the open community and must be forwarded back to the original author for inclusion in future releases of this document. This document or any portion thereof may not be sold or bartered for any form of compensation without expressed written consent of the original author.

By using this document you are agreeing to the terms listed above.

Page 2 of 12 4/7/23

Page 3: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Overview

This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW.

This methodology has been designed to provide the following benefits:1. A high level of performance2. Scalable to any size3. Ease of maintenance4. Boiler-plate development5. Standard documentation techniques

ETL Definitions

Term DefinitionETL – Extract Transform Load The physical process of extracting data

from a source system, transforming the data to the desired state, and loading it into a database

EDW – Enterprise Data Warehouse The logical data warehouse designed for enterprise information storage and reporting

DM – Data Mart A small subset of a data warehouse specifically defined for a subject area

Documentation Specifications

A primary driver of the entire process is accurate business information requirements. Durable Impact Consulting will use standard documents prepared by the Project Management Institute for requirements gathering, project signoff, and compiling all testing information.

ETL Naming Conventions

To maintain consistency all ETL processes will follow a standard naming methodology.

Page 3 of 12 4/7/23

Page 4: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

TablesAll destination tables will utilize the following naming convention: EDW_<SUBJECT>_<TYPE>

There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate, Staging, Temp, and Audit. Sample names are listed below the quick overview of table types.

Fact – a table type that contains atomic data Dimension – a table type that contains referential data needed by the fact tablesAggregate – a table type used to aggregate data, forming a pre-computed answer to a business question (ex. Totals by day)Staging – Tables used to store data during ETL processing but the data is not removed immediatelyTemp – tables used during ETL processing that can immediately be truncated afterwards (ex. storing order ids for lookup)Audit – tables used to keep track of the ETL process (ex. Processing times by job)

Each type of table will be kept in a separate schema. This will decrease maintenance work and time spent looking for a specific table.

Table Name ExplanationEDW_RX_FACT Fact table containing RX subject matterEDW_TIME_DIM Dimension table containing TIME subject

matterEDW_CUSTOMER_AG Aggregate table containing CUSTOMER

subject matterETL_PROCESS_AUDIT Audit table containing PROCESS dataSTG_DI_CUSTOMER Staging table sourced from DI system used

for CUSTOMER data processingETL_ADDRESS_TEMP Temp table used for ADDRESS processing

ETL Processing

There following types of ETL jobs will be used for processing. This table lists the job type, naming convention, and explains the job functions.

Job Type Explanation Naming ConventionExtract Extracts information from a

source systems & places in a staging table

Extract<Source><Subject>ExtractDICustomer

Load PSA Loads the persistent staging area

LoadPSA<Table>

Source and LoadTemp Sources information from Source<Table>

Page 4 of 12 4/7/23

Page 5: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

STG tables & performs column validation and loads temp tables used in processing

SourceSTGDICustomer

Lookup Unload Dimensions Lookup and unload dimension tables into flat files

LookupUnloadDimensions

LookupUnloadFacts Lookup and unload fact tables into flat files

LookupUnloadFacts

TransformFacts Transform the fact subject area data and generate insert files

TransformFacts

TransformDimensions Transform the dimension subject area data and generate insert files

TransformDimensions

QualityCheck Checks the quality of the data before loaded into the EDW

QualityCheck<Subject>QualityCheckCustomer

Aggregate Aggregates data AggregateUpdate Records Loads/Inserts the changed

records into the EDWUpdate Records

ETL Job Standards

All ETL jobs will be created with a boiler-plate approach. This approach allows for rapid creation of similar jobs while keeping maintenance low.

Comments

Every job will have a standard comment template that specifically spells out the following attributes of the job:

Job Name: LoadPSAPurpose: Load the ETL_PSA_CUSTOMERSPredecessor: Extract CustomersDate: July 10, 2007Author: Wes DumeyRevision History: April 21, 2007 – Created the job from standard templateMay 22, 2007 – Added new columns to PSA tables

In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.

Page 5 of 12 4/7/23

Page 6: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Persistent Staging Areas

Data will be received from the source systems in its native format. The data will be stored in a PSA table following the naming standards listed previously. The table will contain the following layout:

Column Data Type ExplanationROW_NUMBER NUMBER Unique for each row in the

PSADATE DATE Date row was placed in the

PSASTATUS_CODE CHAR(1) Indicates status of row (‘I’

inducted, ‘P’ processed, ‘R’ rejected)

ISSUE_CODE NUMBER Code uniquely identifying problems with data if STATUS_CODE = ‘R’

BATCH_NUMBER NUMBER Batch number used to process the data (auditing)

Data columns to follow

Auditing

The ETL methodology maintains a process for providing audit and logging capabilities.

For each run of the process, a unique batch number composed of the time segments is created. This batch number is loaded with the data into the PSA and all target tables. In addition, an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table.

Column Data Type ExplanationDATE DATE (Index) run dateBATCH_NUMBER NUMBER Batch number of processPROCESS_NAME VARCHAR Name of process that was

executedPROCESS_RUN_TIME TIMESTAMP Time (HH:MI:SS) of

process executionPROCESS_STATUS CHAR ‘S’ SUCCESS, ‘F’

FAILUREISSUE_CODE NUMBER Code of issue related to

process failure (if ‘F’)RECORD_PROCESS_COUNT NUMBER Row count of records

processed during run

Page 6 of 12 4/7/23

Page 7: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

The audit process will allow for efficient logging of process execution and encountered errors.

Quality

Due to the sensitive nature of data within the EDW, data quality is a driving priority. Quality will be handled through the following processes:

1. Source job - the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter).

2. Transform – the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly.

3. QualityCheck – a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns.

Source Quality

A data scrubbing mechanism will be constructed. This mechanism will check identified columns for any anomalies (ex. Embedded carriage returns) and value domains. If an error is discovered, the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition).

Transform Quality

The transformation job will employ a matching metadata technique. If the target table enforces NOT NULL constraints, a check will be built into the job preventing NULLS from being loaded and causing a jobstream abend.

Quality Check

Quality check is the last point of validation within the jobstream. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X). QC is designed to pay attention to the most valuable or vulnerable rows with the data sets. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any suspect rows will be pulled from the insert/update files, updated in the PSA table to a ‘R’ status and create an issue code for the failure.

Logging of Data Failures

Page 7 of 12 4/7/23

Page 8: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Data that fails the QC job will not be loaded into the EDW based on defined rules. An entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW). This indicator will allow resources to be used efficiently to trace errors.

ETL_QUALITY_ISSUES

Column Data Type ExplanationDATE DATE Date of entryBATCH_NUMBER NUMBER Batch number of process

creating entryPROCESS_NAME VARCHAR Name of process creating

entryCOLUMN_NAME VARCHAR Name of column failing

validationCOLUMN_VALUE VARCHAR Value of column failing

validationEXPECTED_VALUE VARCHAR Expected value of column

failing validationISSUE_CODE NUMBER Issue code assigned to errorSEVERITY CHAR ‘H’ HIGH, ‘L’ LOW

ETL_QUALITY_AUDIT

Column Data Type ExplanationDATE DATE Date of entryBATCH_NUMBER NUMBER Batch number of process

creating entryPROCESS_NAME VARCHAR Name of entry creating

processRECORD_PROCESS_COUNT NUMBER Number of records

processed RECORD_COUNT_CHECKED NUMBER Number of records

checkedPERCENTAGE_CHECKED NUMBER Percentage of checked

records out of data set

ETL Job Templates

Extract

Page 8 of 12 4/7/23

Page 9: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Source Combined with Load Temp

Page 9 of 12 4/7/23

Page 10: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Lookup Dimension

Page 10 of 12 4/7/23

Page 11: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Transform

Page 11 of 12 4/7/23

Page 12: ETLDesignMethodologyDocument-SSIS

ETL Methodology Document

Load

Closing

After reading this ETL document you should have a better understanding of the issues associated with ETL processing. This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario.

Page 12 of 12 4/7/23