data warhouse introduction

93
8/10/2019 data warhouse introduction http://slidepdf.com/reader/full/data-warhouse-introduction 1/93 1 Copyright © 2013 Tech Mahindra. All rights reserved. Sreenivas_Ram +91 99 499 10792 DWH Concepts

Upload: mahesh-prince

Post on 02-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 1/93

1Copyright © 2013 Tech Mahindra. All rights reserved.

Sreenivas_Ram+91 99 499 10792

DWH Concepts

Page 2: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 2/93

2Copyright © 2013 Tech Mahindra. All rights reserved.

Module Breakup

The whole course is covered in the following modules

Module 1 : DW – Overview & Data warehouse Vs OLTP

Module 2 : Architecture of Data warehouse

Module 3 : ETL Process

Module 4 : Data warehouse Vs Data Mart & Conceptual DW Models

Page 3: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 3/93

3Copyright © 2013 Tech Mahindra. All rights reserved.

Module - 1

Data Warehouse

Concepts

Page 4: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 4/93

Page 5: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 5/93

5Copyright © 2013 Tech Mahindra. All rights reserved.

What is BI?

Business intelligence (BI) is a broad category of application programs andtechnologies for gathering, storing, analyzing, and providing access to data to

help enterprise users make better business decisions.

• BI applications include the activities of

• decision support,

• query and reporting,• online analytical processing (OLAP),

• statistical analysis,

• forecasting, and

• data mining.

Examples : Business Objects : www.businessobjects.com

Page 6: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 6/93

6Copyright © 2013 Tech Mahindra. All rights reserved.

BI- Nutshell

Raw Data

Page 7: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 7/937Copyright © 2013 Tech Mahindra. All rights reserved.

Which are our

lowest/highest margincustomers ? 

Who are my customersand what productsare they buying? 

Which customersare most likely to goto the competition ? 

What impact willnew products/serviceshave on revenueand margins? 

What productpromotionshave the biggestimpact on revenue? 

What is the mosteffective distributionchannel? 

A producer wants to know…. 

Page 8: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 8/938Copyright © 2013 Tech Mahindra. All rights reserved.

Data, Data everywhere yet … 

•I can’t find the data I need 

data is scattered over the network

many versions, subtle differences

• I can’t get the data I need 

need an expert to get the data

• I can’t understand the data I found 

available data poorly documented

• I can’t use the data I found 

results are unexpected

data needs to be transformed fromone form to other

Page 9: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 9/939Copyright © 2013 Tech Mahindra. All rights reserved.

What is a Data Warehouse?

“A single, complete and

consistent store of data

obtained from a variety of

different sources

made available to end users

in a what they can understand

and use in a

business context.” 

[Barry Devlin]

Page 10: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 10/9310Copyright © 2013 Tech Mahindra. All rights reserved.

What is Data Warehousing?

The aim of data warehousing is to make more effective use of the dataavailable in an organization and to aid decision-making processes.

 A data warehouse is a collection of data gathered and organized so that it can

easily by analyzed, extracted, synthesized, and otherwise be used for the

purposes of further understanding the data. It may be contrasted with data that

is gathered to meet immediate business objectives such as order and paymenttransactions, although this data would also usually become part of a data

warehouse.

 A data warehouse is a central integrated database containing data from all the

operational sources and archive systems in an organization. It contains a copy

of transaction data specifically structured for query analysis.

Page 11: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 11/9311Copyright © 2013 Tech Mahindra. All rights reserved.

What are the users saying...

• Data should be integrated across

the enterprise

• Summary data has a real value tothe organization

• Historical data holds the key to

understanding data over time

• What-if capabilities are required

Page 12: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 12/9312Copyright © 2013 Tech Mahindra. All rights reserved.

What is Data Warehousing?

 A process of

transforming data 

into information and

making it available to

users in a timely

enough manner tomake a difference

Data 

Information 

Page 13: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 13/9313Copyright © 2013 Tech Mahindra. All rights reserved.

Evolution 60’s: Batch reports 

hard to find and analyze information inflexible and expensive, reprogram every new request

70’s: Terminal-based DSS and EIS (executive information systems)

still inflexible, not integrated with desktop tools

80’s: Desktop data access and analysis tools  query tools, spreadsheets, GUIs

easier to use, but only access operational databases

90’s till now: Data warehousing with integrated OLAP engines and tools, real

time DW

Page 14: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 14/9314Copyright © 2013 Tech Mahindra. All rights reserved.

Data Warehouse

 A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

 accessible

collection of data that is used primarily in organizational decision making.

-- Bill Inmons, Building the Data Warehouse 1996

Page 15: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 15/9315Copyright © 2013 Tech Mahindra. All rights reserved.

Data Warehouse Architecture

Data WarehouseEngine

Optimized Loader  

ExtractionCleansing

 AnalyzeQuery

Metadata Repository 

RelationalDatabases

LegacyData

PurchasedData

ERPSystems

Page 16: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 16/9316Copyright © 2013 Tech Mahindra. All rights reserved.

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a Memory

• Data Mining provides the Enterprise withIntelligence 

Page 17: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 17/93

Page 18: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 18/93

18Copyright © 2013 Tech Mahindra. All rights reserved.

Why Separate Data Warehouse? Performance

Operational database designed & tuned for known transactions & workloads.

Complex OLAP queries would degrade performance. for op transactions.

Special data organization, access & implementation methods needed for

multidimensional views & queries.

Function

Missing data: Decision support requires historical data, which Operational

database do not typically maintain.

Data consolidation: Decision support requires consolidation (aggregation,

summarization) of data from many heterogeneous sources: operationaldatabases, external sources.

Data quality: Different sources typically use inconsistent data representations,

codes, and formats which have to be reconciled.

Page 19: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 19/93

Page 20: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 20/93

20Copyright © 2013 Tech Mahindra. All rights reserved.

So, what’s different? 

Page 21: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 21/93

21Copyright © 2013 Tech Mahindra. All rights reserved.

Application-Orientation vs Subject-

Orientation Application-Orientation

OperationalDatabase

LoansCreditCard

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

 Vendor

Product

 Activity

Page 22: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 22/93

22Copyright © 2013 Tech Mahindra. All rights reserved.

OLTP vs Data WarehouseOLTP WAREHOUSE (DSS)

 Application Oriented Subject Oriented

Used to run business Used to Analyze business

Detailed data Summarized and Refined

Current up-to-date Snapshot data

Isolated data Integrated Data

Repetitive access Ad-hoc access

Clerical User Knowledge User (Manager

Page 23: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 23/93

23Copyright © 2013 Tech Mahindra. All rights reserved.

OLTP Vs Data Warehouse

OLTP DATA WAREHOUSE

Performance Sensitive Performance Relaxed

Few records accessed at a time

(Tens)

Large volumes accessed at a time

(Millions)

Read / Update Access Mostly Read (Batch Update)

No data redundancy Redundancy present

DB size : 100 MB – 100 GB DB Size : 100 GB – Few TBs

Thousands of Users Hundreds of Users

Page 24: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 24/93

24Copyright © 2013 Tech Mahindra. All rights reserved.

To summarize ...

OLTP Systems areused to “run”  a business

The Data Warehouse helps

to “optimize”  the business

Page 25: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 25/93

25Copyright © 2013 Tech Mahindra. All rights reserved.

 A single, completeand consistent store

of data obtained fromvarious sources

What is BI ?

What is Data Warehouse?

 Architecture of Data Warehouse

How Data Mining works with Data warehouse

Benefits Of Data Warehouse

Differences between Data Warehouse and OLTP

Need For a Separate Warehouse

Quick Recap

Reliable Reporting

Rapid Access To Data

Integrated Data

Better Decision Making

BI incorporates the abilityfor mining the data,analyzing and reporting

Data mining providesthe enterprisewarehouse withintelligence

Used to Analyze the

Business 

Used to Run the

Business 

Page 26: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 26/93

26Copyright © 2013 Tech Mahindra. All rights reserved.

QUIZ

1. _____is a subject-oriented view of a data warehouseOLTP system / Data Staging Area / Data Mart / None

2. Data Mining implies _____

Modeling / Forecasting / Explanatory Analysis

3. An order entry system is an example of OLTP systemTrue / False

4. The number of concurrent users of a data warehouse

are not more

False / True

5. Data Extraction is the Process of _____________ A. Taking information/ data from the source and make it available to

DWH

B. Taking extracted data and loaded into DWH

C. Both

Data Mart

Forecasting

True

True

Both

Page 27: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 27/93

27Copyright © 2013 Tech Mahindra. All rights reserved.

Module – 2

Data Warehouse

Architecture

Page 28: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 28/93

28Copyright © 2013 Tech Mahindra. All rights reserved.

Architecture, Design & Construction

DW Architecture

Loading, Refreshing

Structuring / Modeling

Page 29: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 29/93

29Copyright © 2013 Tech Mahindra. All rights reserved.

Topics to be coveredThis Module provides

Data Warehousing Architecture

Generic Two-Level Architecture

 – Independent Data Mart

 –Dependent Data Mart with Data Store

ETL Process

Data Quality Assurance

Data Quality Tools

ETVL Tools

Meta Data & Importance

29

Page 30: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 30/93

30Copyright © 2013 Tech Mahindra. All rights reserved.

Operational Systems

Information Transformation/Migration Infrastructure 

External Systems

Enterprise

Data Warehouse 

Finance

DatamartIndependent

Sales

DatamartDependent

Mkting

DatamartDependent

Web

Server

Light

Clients

Replication Services 

LAN Clients

Data Warehouse Architecture

Page 31: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 31/93

31Copyright © 2013 Tech Mahindra. All rights reserved.

DataStores

Legacy

System

Metadata

Repository Staging Area

Extraction/

Transformation

Server

To Warehouse/

Data Mart

• Metadata Design/Mgmt

• Scrubbing Tool

• Mapping Tool

• Extraction Mgmt Tool

• Transformation Tool

• Migration Mgmt Tool

Data Warehouse Architecture

Page 32: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 32/93

32Copyright © 2013 Tech Mahindra. All rights reserved.

Data Warehouse Architecture

Generic Two-Level Architecture

Independent Data Mart

Dependent Data Mart and Operational Data Store

 All involve some form of extract ion , t ransformat ion and loading  (ETL)

Page 33: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 33/93

33Copyright © 2013 Tech Mahindra. All rights reserved.

E

T

L

One,company-widewarehouse

Periodic extraction  data is not completely current in warehouse

Generic Two-Level Architecture

Page 34: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 34/93

34Copyright © 2013 Tech Mahindra. All rights reserved.

Independent Data Mart

Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for eachindependent data mart

Data access complexitydue to mult ip le data marts

Independent Data Mart Architecture

34

Dependent data mart with operational

Page 35: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 35/93

35Copyright © 2013 Tech Mahindra. All rights reserved.

Dependent data mart with operationaldata store

ET

L

Single ETL forEnterpris e Data Warehou se(EDW)

Simpler data access

ODS provides option forobtaining current  data

Dependent  data martsloaded from EDW

Dependent Data Mart with Operational

Data Store

35

Page 36: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 36/93

36Copyright © 2013 Tech Mahindra. All rights reserved.

Data and dimensions(tables) are being

shared acrossmultiple data marts

• Generic Two-Level Architecture

Independent Data Mart

Dependent Data Mart

Quick RecapThe Data in thisdata mart is stored

Independent fromother data marts

 

Page 37: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 37/93

37Copyright © 2013 Tech Mahindra. All rights reserved.

QUIZ1. Which of the following statements is FALSE w.r.t top-down

approach?

a. The Data warehouse holds the atomic data extracted from sourcesystems and from there the data is distributed or one of the more data

marts.

b . It takes less time and cost to deploy than other approaches.

c. It enforces consistency and standardization of data across all data

marts.

d . None of the above

2. Main objective's of data warehouse design is/area. Efficient query processing

b. Efficient transaction processing

c. None

3. In independent Data Mart the data and dimensions (tables) are

being shared across multiple data marts.

True / False

4. ODS provides the provision for current data.

True/False

5. The data access complexity is more in Dependent Data Marts

True/False

B

None

False

True

True

Page 38: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 38/93

38Copyright © 2013 Tech Mahindra. All rights reserved.

Module – 3

ETL Process

Page 39: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 39/93

39Copyright © 2013 Tech Mahindra. All rights reserved.

Building a Data warehouse Topics to be covered:

Extracting

Extracting Data

Extraction Techniques

Extraction Tools

Transforming Importance of Quality Data

Characteristics of Quality Data

Data Quality Assurance

Data Quality Tools

Transforming Data Problems and Solutions

Transforming Techniques Transformation tools

Transporting Data (Loading)

39

Page 40: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 40/93

40Copyright © 2013 Tech Mahindra. All rights reserved.

Extracting Data

• Extraction is the Process of getting data from Legacy System or any DataSource

•  After extracting, data is put in staging area where it can be scrubbed and

cleaned

• The source of data may be from a single source or from a multiple source

• If the data is from single source it can come from OLTP system or from a flat file

• If the source is from multiple sources then a connector tool is required to

connect between multiple sources

Page 41: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 41/93

41Copyright © 2013 Tech Mahindra. All rights reserved.

Extracting Data – Methods of Extraction

• The extraction process can be done either by hand coded method or by usingtools

• Examining the Source Data and Identify the Extraction tool

• The Extracts are typically written in Source system Code

  (Ex; PL/SQL or VB Script or COBOL).

• The Extraction tool also generates Source system Code

• Using the tool for Extraction makes the process easier instead of Hand-Coding

• The Pre and Post Process Exists. E.g. Before Extract process there may be acall for sorting the data or a call to a function that scores a record based on a

formula

Page 42: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 42/93

42Copyright © 2013 Tech Mahindra. All rights reserved.

Extracting Data – Methods of Extraction

Tools have well defined disciplined approach and documented

Tools provide an easier way to perform the extraction method by providing click,

drag and drop features

Hand coded extraction techniques allow extraction in cost effective manner

since the PL/SQL construct are available with the RDBMS

Hand coded extraction are used when the extraction is to be taken place where

the programmer has clear idea of data structures

 Advantageous and disadvantages over Custom-programmed Extraction

  (PL SQL Scripts) and tool based extraction

Page 43: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 43/93

43Copyright © 2013 Tech Mahindra. All rights reserved.

Extraction Techniques

Bulk Extraction

The entire data warehouse is refreshed periodically by extraction's from the

source systems

 All applicable data are extracted from the source systems for loading into

the warehouse

This approach heavily uses the network connection for loading data from

source to target databases, but such mechanism is easy to set up and

maintain

43

Page 44: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 44/93

44Copyright © 2013 Tech Mahindra. All rights reserved.

Extraction Techniques

Change-Based Replication

Only data that have been newly inserted or updated in the source systems

are extracted and loaded into the warehouse.

This approach uses less network connection due to the volume of data to

be transported.

This mechanism involves complex programming to determine when a new

warehouse record to be inserted or when an existing warehouse record

must be updated.

44

Page 45: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 45/93

Page 46: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 46/93

46Copyright © 2013 Tech Mahindra. All rights reserved.

Extraction Tools

Criteria for Identifying Extraction Tool

The Source System Platform and Database

Tools cannot access all types of data source on all types of Computing

platforms

Built-in Extraction or Duplication FunctionalityThe availability of built-in extraction or duplication reduces the technical

difficulties inherent in the data extraction process.

46

Page 47: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 47/93

47Copyright © 2013 Tech Mahindra. All rights reserved.

Extraction ToolsExtraction Tools include

Apertus Carleton. Passport

 – Users enter extraction and transformation parameters, and Data is

filtered against domains and ranges of legal values.

Evolutionary Technologies. ETL Extract.

 – Users write transformation rules. Data is filtered against domains andranges of legal values and compared to other data structures

Platinum. InfoPump

 –  A data pump product designed to extract data from several

mainframe and client server platforms, perform some filtering and

transformation, and distribute and load to another mainframeplatform database.

 – Requires InfoHub for most services.

 – The extraction uses custom code modules. This is a client/server

based tool.

Page 48: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 48/93

48Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation Phase

Importance of Quality Data

Creating Business Rules

Tools are available to create Reusable Transformation Modules or objects

Simple Data Transformation which includes Date, Number and Character

Conversion

 Assigning Surrogate Keys

Combining from Separate Sources

Validating one to one and one to many Relationships

Page 49: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 49/93

49Copyright © 2013 Tech Mahindra. All rights reserved.

Transforming Data

Importance of Quality data.

Transformation

Transforming data : Problems and Solutions

Transformation Techniques

Transformation Tools

49

Page 50: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 50/93

50Copyright © 2013 Tech Mahindra. All rights reserved.

Importance of Quality Data

Quality Data:

• Before the extracted data is to be transformed, the quality of the data has

to be looked on.

• Once quality data is transformed there will be minimum necessary to

change the data at the target which reduces inconsistencies betweensource and target.

• Data quality management is an approach to enterprise information that

ensures your data to be consistent, accurate and reliable

50

Page 51: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 51/93

51Copyright © 2013 Tech Mahindra. All rights reserved.

Data Quality Assurance

Characteristics of Quality Data

 Accurate

Complete

Consistent

Unique

Timely

51

Page 52: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 52/93

52Copyright © 2013 Tech Mahindra. All rights reserved.

Data Quality Assurance

• Data Quality Tools assist warehousing teams with the task of locating andcorrecting data errors.

• Corrections of data can be made to source or to the target. But when

corrections are made to target it causes inconsistencies between the source

and target data which create synchronization problems.

52

Page 53: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 53/93

53Copyright © 2013 Tech Mahindra. All rights reserved.

Data Quality Tools

• Though dirty data continue to be the biggest issues for data warehousinginitiatives, research indicates that data quality investments are small

percentage to total warehouse spending.

• These are some of the Data Quality Tools available 

DataFlux. Data Quality Workbench.

Pine Cone Systems. Content Tracker. Prism. Quality Manager.

Vality Technology. Integrity Data Reengineering

53

Page 54: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 54/93

54Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation

Transformation :

• Transformation is process by which extracted data are transformed into

appropriate format.

• The data extracted in put into the staging area where cleaning, scrubbing takes

place and stored so that transformation of the clean data can take place.

• For transformation phase data can come from cleansing tool.

•  After transformation data goes to the transportation stage.

Page 55: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 55/93

55Copyright © 2013 Tech Mahindra. All rights reserved.

Transforming Data – Problems

The Common Problems of Data that come out of a Legacy System are :

Inconsistent or Incorrect use of codes and special characters.

 A Single Field is Used for unofficial or undocumented purposes.

Overloaded Codes.

Evolving Data.

Missing, Incorrect or Duplicate values. 

55

Page 56: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 56/93

56Copyright © 2013 Tech Mahindra. All rights reserved.

Transforming Data – some Solutions

There are different solutions available to ensure the data to be loaded is Corrector not:

Cross-Footing

 A template for the quality data norms can be used to identify the

erroneous data by comparing with the norms in the template.

Manual Examination

 A sampling methodology can be selected and a manual examination can

be made on the sampled data

Process Validation

Scripts can be generated which takes care of identifying erroneous andsegregate them

56

Page 57: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 57/93

57Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation Techniques

Field Splitting and consolidation :

• Single physical field in source system needs to split up into more than one

target warehouse field.

• Several source system field must be consolidated and stored in one single

warehouse field Address field

# 123 ABC Street,

DEF City,

Republic of GH

 No : 123

Street : ABC STREET

City : DEF

Country : GH

Page 58: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 58/93

58Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation Techniques

Standardization :• Standards and conventions for abbreviations are applied to individual data

items to improve uniformity in both source and target objects.

System A

Order Date

05 August 2007

System B

Order Date08-08-07

System A

Order Date

 August 05 2007

System B

Order Date

 August 08 2007

Page 59: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 59/93

59Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation Techniques

De-duplication :• Rules are defined to identify duplicate stores of customers or products. In

case of two or more repeated records, they are merged to form one

warehouse record.

System A

Customer Name :

John W Istin

System B

Customer Name :

John William Istin

Customer Name :

John William Istin

Page 60: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 60/93

60Copyright © 2013 Tech Mahindra. All rights reserved.

Transformation Tools

Some of the Transformation tools includes

 Apertus Carleton. Enterprise/Integrarot.

Data Mirror. Transformation Server.

Informatica. Power Mart Designer.

Page 61: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 61/93

61Copyright © 2013 Tech Mahindra. All rights reserved.

Building DWH in Refresh Phase

Process Slowly Changing Dimensions

 Automate the Extract-Transform-Load Cycle.

Incremental Fact Table Extracts.

Purging and Archiving Data.

Page 62: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 62/93

62Copyright © 2013 Tech Mahindra. All rights reserved.

Static extract = capturing a snapshot ofthe source data at a point in time

Incremental extract = capturing changesthat have occurred since the last staticextract

Capture = extract…obtaining asnapshot of a chosen subset of thesource data for loading into the datawarehouse

Steps in data reconciliation at building

DWH – EXTRACT

Steps in data reconciliation (continued)

Page 63: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 63/93

63Copyright © 2013 Tech Mahindra. All rights reserved.

Steps in data reconciliation (continued)

Scrub = cleanse…uses patternrecognition and AI techniques toupgrade data quality

Fixing errors: misspellings, erroneousdates, incorrect field usage, mismatchedaddresses, missing data, duplicate data,inconsistencies

Also: decoding, reformatting, timestamping, conversion, key generation,merging, error detection/logging, locatingmissing data

Steps in Data Reconciliation – 

TRANSFORM

Steps in data reconciliation (continued)

Page 64: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 64/93

64Copyright © 2013 Tech Mahindra. All rights reserved.

Steps in data reconciliation (continued)

Transform = convert data fromformat of operational system toformat of data warehouse

Record-level:

Selection – data partitioning

Joining   – data combining

 Aggregation  – data summarization

Field-level: 

single-field   – from one field to one field

multi-field   – from many fields to one, or one

field to many

Steps in Data Reconciliation – 

TRANSFORM

Steps in data reconciliation (continued)

Page 65: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 65/93

65Copyright © 2013 Tech Mahindra. All rights reserved.

Steps in data reconciliation (continued)

Load/Index= place transformed datainto the warehouse and createindexes

Refresh mode: bulk rewriting of targetdata at periodic intervals

Update mode: only changes in sourcedata are written to data warehouse

Steps in Data Reconciliation – 

TRANSPOSE (LOAD)

Page 66: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 66/93

66Copyright © 2013 Tech Mahindra. All rights reserved.

Transporting Phase

 

Insert statements create Logs

 Bulk Loader is advisable

 Truncate target tables before full refresh

 Index Management

 Drop and re-index.

Page 67: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 67/93

67Copyright © 2013 Tech Mahindra. All rights reserved.

Transporting Data

Transporting data into Warehouse

Building the Transportation Process

Transporting the data

Post processing of loaded data

Page 68: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 68/93

68Copyright © 2013 Tech Mahindra. All rights reserved.

Transporting Data into Warehouse

• The transformed data is transported into the data warehouse.

• The load images are transported through the loaders into the warehouse.

Data Loaders :

• Data loaders load transformed data into the data warehouse.

• Stored procedures can be used to handle the warehouse loading if the

images are available in same RDBMS engine.

Page 69: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 69/93

69Copyright © 2013 Tech Mahindra. All rights reserved.

Transporting Data into Warehouse

EXTRACT load

   S  o  u  r  c  e

    D  a   t  a

   S   t  a  g   i  n

  g

    A  r  e  a

   W  a  r  e   h  o  u  s  e

    S  c   h  e  m  a

Page 70: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 70/93

70Copyright © 2013 Tech Mahindra. All rights reserved.

Transporting Data into Warehouse

Warehou se Schema :

• It is nothing but the Dimensional Model(dimensions and facts)

Staging Area :

• It is nothing but workspace where data is ready after cleaning. This is for

minimizing the time required to prepare the data.

Sou rce Data :

• This can be flat file, oracle table or some other form.

Page 71: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 71/93

71Copyright © 2013 Tech Mahindra. All rights reserved.

Building the Transporting Process

For Transporting Data we can use:

PL/SQL scripts

SQL Loader Routines for flat files

ETL Tool

Similarly we can use SQL Loader for directly putting the data from the flat files

to the tables

We use this for the Bulk loading.

SQL Loader can be used for loading varying length and fixed format files.

Datawarehouse Building

Page 72: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 72/93

72Copyright © 2013 Tech Mahindra. All rights reserved.

Datawarehouse Building 

Abstract View of a Data Warehouse

BuildingSource – A

Part – ASource – B

Part – BSource – C

Part – C

A B C

A

B

C

Analytical

Operational

User’s View 

Extraction

Transformation

Categorization of

transaction data

Page 73: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 73/93

73Copyright © 2013 Tech Mahindra. All rights reserved.

ETVL Tools

The following are the Popular ETL Tools

Oralce Warehouse Builder

Informatica

Sagent

SAS Warehouse Administrator

Page 74: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 74/93

74Copyright © 2013 Tech Mahindra. All rights reserved.

LEAVING A METADATA TRAIL

Defining Warehouse Metadata

Developing a Metadata Strategy

Examining types of Metadata

Metadata Management Tools

Common Warehouse Metadata

Page 75: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 75/93

75Copyright © 2013 Tech Mahindra. All rights reserved.

Metadata

What is Metadata?

Traditionally defined as data about data

Form of abstraction that describes the structure and contents of the data

warehouse

Page 76: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 76/93

Page 77: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 77/93

77Copyright © 2013 Tech Mahindra. All rights reserved.

Importance of Metadata

Metadata establish the context of the Warehouse data

Metadata facilitate the Analysis Process

Metadata are a form of Audit Trail for Data Transformation

Metadata Improve or Maintain Data Quality

Th P f tti

Page 78: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 78/93

78Copyright © 2013 Tech Mahindra. All rights reserved.

It is the process inwhich extracted dataare transformed into

appropriate format

Quick RecapThe Process of gettingdata from Legacy Systemor any Data Source.

The transformeddata is loaded in tothe warehouse

 

ETL Process

Extracting Data

Transforming DataLoading Data

Building the Data Warehouse using

Extracting techniques

Transforming techniques

Transporting techniques

ETVL Tools

Meta Data & Importance

Meta Data is Dataabout Data and isimportant for datatransformation.

Page 79: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 79/93

79Copyright © 2013 Tech Mahindra. All rights reserved.

QUIZ1. Which of the following is not an OLAP tool?Oracle Express / OWB / Cognos / Microstrategy

2. Which of the following should be the goals of the ETL

application development process? Modular and re-usable code Self documenting process flows Fully metadata aware process  All of the above

3 . The information about the data i.e. Data about the Data is

kept in: RDBMS DBMS Metadata

4. Hand coded extraction techniques allow extraction incost effective mannerTrue / False

5. How do you handle slowly changing dimensions? Manually handle them Using a data staging tool Both of the above

OWB

ALL THE ABOVE

META DATA

True

BOTH

Page 80: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 80/93

80Copyright © 2013 Tech Mahindra. All rights reserved.

Module – 4

Data Warehousevs

Data Mart

Page 81: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 81/93

81Copyright © 2013 Tech Mahindra. All rights reserved.

Topics to be covered

This Module provides

What is a Data Mart

Data Mart-Approaches

Top-Down Approach

Bottom-Up Approach

Hybrid Approach

Conceptual Modeling of Data Warehouse with Examples

Star Schema

Snow Flake Schema

Fact Constellations

81

Page 82: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 82/93

82Copyright © 2013 Tech Mahindra. All rights reserved.

Data Mart

Data mart is:

 A func t ional segment  of an enterprise restricted for

purposes of security, locality, performance, or business

necessity using modeling and information delivery 

techn iques ident ical to data warehousing . 

82

Page 83: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 83/93

83Copyright © 2013 Tech Mahindra. All rights reserved.

Data Mart- Approach

Physical data warehouse (physical)

Data warehouse --> data marts

Data marts --> data warehouse

Parallel data warehouse and data marts

83

Page 84: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 84/93

84Copyright © 2013 Tech Mahindra. All rights reserved.

Top-down

84

SOURCE DATA

External

Data

Operational Data

Staging Area

Data Warehouse Data Marts

Physical Data Warehouse:

Data Warehouse --> Data Marts

Page 85: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 85/93

85Copyright © 2013 Tech Mahindra. All rights reserved.

Bottom-up approach

85

SOURCE DATA

ExternalData

Operational Data

Staging Area

Data Warehouse

Data Marts

Physical Data Warehouse:

Data Marts --> Data Warehouse

Page 86: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 86/93

86Copyright © 2013 Tech Mahindra. All rights reserved.

Hybrid

86

SOURCE DATA

ExternalData

Operational Data

Staging Area

Data Warehouse

Data Marts

Physical Data Warehouse:Parallel Data Warehouse & Data Marts

Page 87: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 87/93

87Copyright © 2013 Tech Mahindra. All rights reserved.

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

Star schema

Snowflake schema

Fact constellations 

87

Page 88: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 88/93

88Copyright © 2013 Tech Mahindra. All rights reserved.

Example of Star Schema

88

time_keyday

day_of_the_weekmonthquarter

year

time

location_keystreetcity

province_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_name

brandtype

supplier_type

item

branch_keybranch_namebranch_type

branch

Page 89: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 89/93

89Copyright © 2013 Tech Mahindra. All rights reserved.

Example of Snowflake Schema

89

time_keyday

day_of_the_weekmonthquarter

year

time

location_keystreet

city_key

location

Sales Fact Table

time_key

item_keybranch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_name

brandtype

supplier_key

item

branch_keybranch_name

branch_type

branch

supplier_keysupplier_type

supplier

city_keycity

province_or_streetcountry

city

Page 90: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 90/93

90Copyright © 2013 Tech Mahindra. All rights reserved.

Example of Fact Constellation

90

time_keyday

day_of_the_weekmonthquarter

year

time

location_keystreetcity

province_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtype

supplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Page 91: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 91/93

91Copyright © 2013 Tech Mahindra. All rights reserved.

In this module we have seen the following topics.

What is a Data Mart?

Various approaches to build the Data MartTop-Down ApproachBottom-Up ApproachHybrid Approach

Conceptual Modeling UsingStar SchemaSnow Flake SchemaFact Constellations

Examples of the Modeling Techniques

Quick Recap

Data Warehouse toData Marts

Data Warehouse and

Data Marts are built inparallel

Data Marts toData Warehouse

Single Fact Tablesurrounded by

multiple dimensions

Single Fact tablesurrounded by

normalized DimensionsOne or More Fact tables

surrounded by Dimensions

The Subset of data

warehouse related to a

singe subject area

Page 92: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 92/93

92Copyright © 2013 Tech Mahindra. All rights reserved.

QUIZ1. Which type of Data Warehouse schema normalizesdimensions to eliminate redundancy? Star Schema

Snowflake Schema

2. Data marts always have multiple subject areas True/False

3. In fact constellation, there are many fact table sharing thesame dimension tables.

False/True4. An Enterprise Warehouse can be built by combining the

Data Marts  False/True

5. Which of the following are the entry points in the

Warehouse Fact tables Dimension Tables

Snow Flake Schema

False

True

True

Dimension

Page 93: data warhouse introduction

8/10/2019 data warhouse introduction

http://slidepdf.com/reader/full/data-warhouse-introduction 93/93

mahindrasatyam.com

Safe Harbor

This document contains forward-looking statements within the meaning of section 27A of Securities Act of 1933, as amended, andsection 21E of the Securities Exchange Act of 1934, as amended. The forward-looking statements contained herein are subject tocertain risks and uncertainties that could cause actual results to differ materially from those reflected in the forward-lookingstatements. We undertake no duty to update any forward-looking statements. For a discussion of the risks associated with ourbusiness, please see the discussions under the heading “Risk  Factors”  in our report on Form 6-K concerning the quarter ended

September 30, 2008, furnished to the Securities and Exchange Commission on 07 November, 2008, and the other reports filed withthe Securities and Exchange Commission from time to time. These filings are available at http://www.sec.gov  

Thank you