data warehousing seminar chapter 5. data warehouse design methodology data warehousing lab. hyeyoung...

24
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. Data Warehousing Lab. HyeYoung Cho HyeYoung Cho

Upload: claud-godfrey-malone

Post on 27-Dec-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

Data Warehousing SeminarChapter 5. Data Warehouse Design Methodology

Data Warehousing Lab.Data Warehousing Lab.

HyeYoung ChoHyeYoung Cho

Page 2: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

2Data Warehousing

Lab.DW

Index

The Information Utility's Infrastructure The Preferred Architecture: Integration Layer and High

Performance Query Structures

Alternate Warehousing Architectures

Data Store 1 - The Source SystemsData Store 1 - The Source SystemsData Flow 1 - From the Data sources to the Integration layerData Flow 1 - From the Data sources to the Integration layerData Store 2 - The Integration LayerData Store 2 - The Integration LayerData Flow 2 - From the Integration Layer to the High Performance Data Flow 2 - From the Integration Layer to the High Performance Query StructuresQuery StructuresData Store 3 - High Performance Query Structures(HPQS)Data Store 3 - High Performance Query Structures(HPQS)Data Flow 3 - From the High Performance Query Structures to the Data Flow 3 - From the High Performance Query Structures to the End User Reporting ApplicationsEnd User Reporting ApplicationsData Store 4 - Data in the End User's HandsData Store 4 - Data in the End User's Hands

Page 3: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

3Data Warehousing

Lab.DW

The Information Utility's Infrastructure

warehouse must:warehouse must: extract data from a variety of sources integrate data into a common repository put data into a format that users can use provide users with tools to access the

warehouse

Page 4: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

4Data Warehousing

Lab.DW

The Preferred Architecture:Integration Layer and High Performance Query Structures

4 data stores and 3 data flows.4 data stores and 3 data flows.

Page 5: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

5Data Warehousing

Lab.DW

Data Store 1 - The Source Systems

provide data to warehouseprovide data to warehouse enterprise resource planning package(ERP)

SAP, PeopleSoft, Oracle applicationsSAP, PeopleSoft, Oracle applications

home-grown applications OASIS systemOASIS system

outside sources data purchased from outside vendorsdata purchased from outside vendors

source systems

sales, accounting, distribution,

etc.

warehousedata

Page 6: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

6Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

data extraction stepdata extraction step data out of its sources extracted at the beginning of every data flow very complex step

variety of data storage technologies ex. variety of data storage technologies ex. Oracle, DB2, Infomix, IMS, other formats Oracle, DB2, Infomix, IMS, other formats

-> require select statements and each code-> require select statements and each code

consideration for extraction

Page 7: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

7Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

Is This Extract Supporting the Initial Load of the Is This Extract Supporting the Initial Load of the Warehouse or a Periodic Refresh Load?Warehouse or a Periodic Refresh Load? problems with complete refreshes

warehouse is a record of history!warehouse is a record of history!

-> frequently lost by source systems.-> frequently lost by source systems. warehouses tend to be very large!warehouses tend to be very large!

-> poor computing and telecommunications bandwidth-> poor computing and telecommunications bandwidth

two architectures to load warehouse

initial load periodical refresh

history data from offline storage

online data

bring it all over changed source records

use special logic for timestamps

 

Page 8: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

8Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

How Will I Determine What Records to How Will I Determine What Records to Extract?Extract? change data capture

what source records have changedwhat source records have changed how, those records are moved to the warehousehow, those records are moved to the warehouse

delete question! no trace, the deleted record is just gone!no trace, the deleted record is just gone!

Techniques recognizing changes TimestampsTimestamps

records whenever inserted and deleted reduced search what records have changed.

TriggersTriggers put trigger on the source tables write a corresponding(insert,update,delete) message

in a log file

Page 9: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

9Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

Application Integration Software(AIS)Application Integration Software(AIS) MQ Series, Mercator, Tibco.. link applications, when a transaction occurs in one,

transmit it to all the others. all transactions in AIS-enabled systemsall transactions in AIS-enabled systems real-time access to datareal-time access to data

File ComparesFile Compares compare today’s file to the last loaded file difficult implementation and less accuracy

Page 10: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

10Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

How Will I Format the Extracted Records?How Will I Format the Extracted Records? store extracted records with each mean

what source system generated the recordwhat source system generated the record when the record was obtained, when the record was obtained, the key of the recordthe key of the record

What Will I Do with the Extracted Records?What Will I Do with the Extracted Records? data loading programs

read flat files / load the data into the warehouseread flat files / load the data into the warehouse

"loosely coupled" warehousing architectures separate extract programs and load programsseparate extract programs and load programs

->more flexible and maintainable warehouse!->more flexible and maintainable warehouse!

Page 11: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

11Data Warehousing

Lab.DW

Flow 1 - From the Data sources to the Integration layer

A Few Notes About Dirty DataA Few Notes About Dirty Data dirty in several ways

Format violationsFormat violations Referential integrity violationsReferential integrity violations Cross-system matching violationsCross-system matching violations Internal consistency violationsInternal consistency violations

dirty data makes warehouse unreliablemakes warehouse unreliable corrected in the source systems before extractingcorrected in the source systems before extracting both refresh data and history databoth refresh data and history data

Page 12: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

12Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

a normalized database in a single placea normalized database in a single place normalizationnormalization

break flat file into smaller files to store the data more efficiently.

Why Build an Integration Layer?Why Build an Integration Layer? Avoids extraction repetition

multiple data marts using data from same source systemsmultiple data marts using data from same source systems

-> read from only one source(already integrated, clean data)-> read from only one source(already integrated, clean data)

Ensures standard interpretation of enterprise data multiple groups interpret the same data differentlymultiple groups interpret the same data differently

-> develop common definitions shared across the organization-> develop common definitions shared across the organization

Provides a more flexible repository than the denormalized structures in the HPQS layer denormalized data structures in HPQS for querying are inflexibledenormalized data structures in HPQS for querying are inflexible

-> complex and required reintegration, recleasing-> complex and required reintegration, recleasing

Page 13: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

13Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

Page 14: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

14Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

Introduction to Database NormalizationIntroduction to Database Normalization

- data model in third normal form- data model in third normal form completely denormalized Data

1NF

Page 15: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

15Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

First Normal Form eliminate repeating groups!eliminate repeating groups!

2NF

Page 16: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

16Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

Second Normal Form all non-key attributes of a table must rely on the all non-key attributes of a table must rely on the

entire key of the tableentire key of the table

3NF

Page 17: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

17Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

Third Normal Form all non-key fields must depend solely on the table's all non-key fields must depend solely on the table's

primary keyprimary key

Page 18: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

18Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

What "Extra" Data Must the Integration Layer Hold?What "Extra" Data Must the Integration Layer Hold? surrogate Keys

Sequential number generated by warehouse load programsSequential number generated by warehouse load programs have no business meaninghave no business meaning BenefitsBenefits

single surrogate key for same attribute having different keys easy tracking for Moving information

dates, statuses, and other fields auditing support, easy identifying data to data martauditing support, easy identifying data to data mart additional information in the warehouseadditional information in the warehouse

Ex. insert date, last update date, status flag, etc.Ex. insert date, last update date, status flag, etc.

Another Note About Dirty DataAnother Note About Dirty Data Techniques for handling bad records

Ignoring them.Ignoring them. Rejecting bad records, but saving them in a separate file for manual review.Rejecting bad records, but saving them in a separate file for manual review. Loading the bad record and pointing out the errors for later review.Loading the bad record and pointing out the errors for later review.

Page 19: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

19Data Warehousing

Lab.DW

Data Store 2 - The Integration Layer

key

Page 20: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

20Data Warehousing

Lab.DW

Data Flow 2 - From the Integration Layer to the High Performance Query Structures

data is extracted from the integration layer and data is extracted from the integration layer and inserted into the data martsinserted into the data marts ETL: extract, transform, and load to populate data marts benefits loading from integration lay

no cleansing and integrationno cleansing and integration Identifying the loading records using timestampsIdentifying the loading records using timestamps no creating surrogate keys (only reuse!)no creating surrogate keys (only reuse!)

use of summary tables differ from data warehousediffer from data warehouse some summaries of their atomic-level detailsome summaries of their atomic-level detail

->load both the atomic level data and summary tables->load both the atomic level data and summary tables Oracle8iOracle8i

create materialized view automatical refresh every commit

Page 21: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

21Data Warehousing

Lab.DW

Data Store 3 - High Performance Query Structures(HPQS)

databases and data structures to support end-user databases and data structures to support end-user queriesqueries

databases managed by either relational database databases managed by either relational database engines or multidimensional database enginesengines or multidimensional database engines

logical structure, not physical structurelogical structure, not physical structure share the same computer With data warehouse physically different table designs

more easier and speedier for end user to access than more easier and speedier for end user to access than normalized database formats.normalized database formats.

Page 22: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

22Data Warehousing

Lab.DW

Data Flow 3 - From the High Performance Query Structures to the End User Reporting Applications

Query tools issue SQL calls to relational Query tools issue SQL calls to relational databasesdatabases

data is returned to the tools and data is returned to the tools and formatedformated

Page 23: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

23Data Warehousing

Lab.DW

Data Store 4 - Data in the End User's Hands

report and analysis in end-user's hands report and analysis in end-user's hands the last data store in warehousing architecture "How can I prevent a bad employee from selling

warehouse data to one of our competitions?" only way to deny him access to that data in the only way to deny him access to that data in the

first placefirst place

Page 24: Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

24Data Warehousing

Lab.DW

Alternate Warehousing Architectures

Alternate Architecture 1 - No WarehouseAlternate Architecture 1 - No Warehouse no demand for a warehouse , don't build it

transaction systems are strong and end -user queries are limitedtransaction systems are strong and end -user queries are limited

Alternate Architecture 2 - Normalized DesignAlternate Architecture 2 - Normalized Design data integrated in integration layer users query directly out of the integration layer

integration benefits, no usability and query performanceintegration benefits, no usability and query performance

Alternate Architecture 3 - Just Data MartsAlternate Architecture 3 - Just Data Marts building one or more data marts without a normalized integration

layer no need data integrated from multiple systems.no need data integrated from multiple systems.