the data warehouse this is where the data lives. 2 agenda zwhat is a warehouse? zdata warehouse...

125
The Data Warehouse This is where the data lives

Upload: patience-haynes

Post on 30-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

The Data Warehouse

This is where the data lives

Page 2: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

2

Agenda

What is a Warehouse?Data Warehouse

ArchitectureData StorageData TransformationDBMS in

DatawarehousingBuilding a Successful

Data Warehouse

Page 3: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

WHAT IS A

DATAWAREHOUSE ?

Page 4: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

4

Peter Drucker believes ...

The two most important people in the 21st Century will be the CFO (managing the Cash Flow) and the CIO (managing the Information Flow)

Page 5: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

5

Data, Data everywhereyet ...

I can’t find the data I need data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed

from one form to other

Page 6: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

6

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

Why Data Warehousing?

Page 7: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

7

What are the users saying...

Data should be integrated across the enterprise

Summary data had a real value to the organization

Historical data held the key to understanding data over time

What-if capabilities are required

Page 8: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

8

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

Information

Page 9: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

9

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

Page 10: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

10

Data Warehousing -- It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

A decision support database maintained separately from the organization’s operational database

Page 11: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

11

Data Warehouse

A data warehouse is a subject-oriented integrated time-varying non-volatile

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 12: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

12

Traditional RDBMS used for OLTP

Database Systems have been used traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical

Will call these operational systems

Page 13: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

13

Operational Systems

Run the business in real time Based on up-to-the-second data Optimized to handle large numbers

of simple read/write transactions Optimized for fast response to

predefined transactions Used by people who deal with

customers, products -- clerks, salespeople etc.

They are increasingly used by customers

Page 14: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

14

Examples of Operational Data

Data IndustryUsage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

Client/Server,relational databases

Very Large

CallRecord

Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

New applications,relational databases,AS/400

Medium

Page 15: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

15

Application-Orientation vs. Subject-Orientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity

Page 16: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

16

OLTP vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse

Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)

e.g., average amount spent on phone calls between 9AM-5PM in California during the month of December

Page 17: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

17

OLTP vs. Data Warehouse

Complex Data Warehouse queries would degrade performance of operational DBMS

Data Warehouse requires historical data; not typically maintained by operational databases

Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources: operational DBMS, external sources, legacy systems

Different sources typically use different representations, code and format which have to be reconciled

Page 18: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

18

OLTP vs Data Warehouse

OLTP Application

Oriented Used to run

business Detailed data Current up to date Isolated Data Repetitive access Clerical User

Warehouse (DSS) Subject Oriented Used to analyze

business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User

(Manager)

Page 19: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

19

OLTP vs Data Warehouse

OLTP Performance Sensitive Few Records accessed

at a time (tens)

Read/Update Access

No data redundancy Database Size

100MB -100 GB

Data Warehouse Performance relaxed Large volumes

accessed at a time(millions)

Mostly Read (Batch Update)

Redundancy present Database Size

100 GB - few terabytes

Page 20: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

20

OLTP vs Data Warehouse

OLTP Transaction

throughput is the performance metric

Thousands of users Managed in entirety

Data Warehouse Query throughput is

the performance metric

Hundreds of users Managed by

subsets

Page 21: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

21

Criteria for Selecting a Warehouse

load/index time query response time database size

requirements/limitations quality ratio of raw data size to full

database size (including indices, temp space, etc.)

parallel capabilities price company DBMS

standardization policy

Page 22: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

DATAWAREHOUSE

ARCHITECTURE

Page 23: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

23

Data Warehouse Architecture - Process View

RelationalDatabases

LegacyData

Purchased Data

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

Page 24: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

24

IT users

Operational Data Stores

DataTransformatio

nEnterprise Warehouse Management

Replication &Propagation

Data Marts Departmental Warehouses

Knowledge Discovery/Data Mining

Information Access Tools

Business Users

Getting Data In

Heart of theData Warehouse

GettingInformation Out

ωί Ιί θ

Data Warehouse Architecture - User View

Page 25: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

25

Heart of the Data Warehouse

Heart of the data warehouse is the data itself!

Single version of the truthCorporate memoryData is organized in a way that

represents business -- subject orientation

Page 26: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

26

Data Warehouse Structure

Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

Page 27: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

27

Data Warehouse Structure

base customer (1985-87)custid, from date, to date, name, phone, dob

base customer (1988-90)custid, from date, to date, name, credit rating,

employer customer activity (1986-89) -- monthly

summary customer activity detail (1987-89)

custid, activity date, amount, clerk id, order no customer activity detail (1990-91)

custid, activity date, amount, line item no, order no

Time is Time is part of part of key of key of each tableeach table

Page 28: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

28

Data Warehouse Structure

Base Customer (1985-87)Base Customer

(1988-90)Cust Activity (1986-89)

Cust Activity Detail (1987-89)

Cust activity Detail (1990-91)

Subject data may contain different data on different media

Can also useoptical disks

Page 29: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

29

Schema Design

Database organization must look like business must be recognizable by business user approachable by business user Must be simple

Schema Types Star Schema Fact Constellation Schema Snowflake schema

Page 30: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

30

Dimension Tables

Dimension tables Define business in terms already familiar to

users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions

time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Page 31: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

31

Fact Table

Central table mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a

billion) Access via dimensions

Page 32: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

32

Star Schema

A single fact table and for each dimension one dimension table

Does not capture hierarchies directly

T im

e

prod

cust

city

fact

date, custno, prodno, cityname, ...

Page 33: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

33

Snowflake schema

Represent dimensional hierarchy directly by normalizing tables.

Easy to maintain and saves storage

T im

e

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 34: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

34

Fact Constellation

Fact Constellation Multiple fact tables that share many

dimension tables Booking and Checkout may share many

dimension tables in the hotel industryHotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

Page 35: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

DATA STORAGE

IN DATAWAREHOUSE

Page 36: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

36

Data Granularity in Warehouse

Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller

number of records to be processed design around traditional high level

reporting needs tradeoff with volume of data to be stored

and detailed usage of data

Page 37: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

37

Granularity in Warehouse

Can not answer some questions with summarized data Did Ashish call Vivek last month? Not

possible to answer if total duration of calls by Ashish over a month is only maintained and individual call details are not.

Detailed data too voluminous

Page 38: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

38

Granularity in Warehouse

Tradeoff is to have dual level of granularity Store summary data on disks

95% of DSS processing done against this data

Store detail on tapes5% of DSS processing against this data

Page 39: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

39

Estimates of Data Volume

To determine appropriate level (dual or single) of granularity we need to estimate the disk space requirements

For each known table Get upper and lower bounds on size of row Estimate max and min number of rows in

table for the 1 year horizon and the 5 year horizon

Calculate space for indexes for max and min number of rows

Page 40: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

40

Dual level of granularity

1 Year Horizon 5 Year Horizon

10,000

100,000

1,000,000

10,000,000

# Rows

Any design will do

Careful Design

Dual levels ofgranularity

Dual levels of granularity and careful design

# Rows

100,000

1,000,000

10,000,000

20,000,000Dual levels of granularity and careful design

Careful Design

Any design will do

Dual levels ofgranularity

Page 41: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

41

Dual Level of Granularity

On the five year horizon, the totals shift by an order of magnitude. Reason is: More expertise will be available in managing

the data warehouse volumes of data Hardware costs will have dropped More powerful software tools will be available End user will be more sophisticated

Actual size of record is not that important since size of indexes determines the above

Page 42: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

42

What should be granularity level?

Starting point for deciding level of granularity is made on the basis of previous estimates and some educated guess

The initial guess is refined through an iterative analysis

Reports

Analysis

Design

Populate

Data WarehouseDeveloper

DSS Analysts

Page 43: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

43

How to control granularity?

Summarize data from source as it goes into target Average data as it goes into target Push highest/lowest set values into target

Push only data that is needed at targetPush only a subset of rows based on

some conditions

Page 44: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

44

Levels of Granularity

Operational

60 days ofactivity

account activity date amount teller location account bal

accountmonth # trans withdrawals deposits average bal

amountactivity date amount account bal

monthly accountregister -- up to 10 years

Not all fieldsneed be archived

Banking Example

Page 45: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

45

Partitioning

Breaking data into several physical units that can be handled separately

Not a question of whether to do it in data warehouses but how to do it

Granularity and partitioning are key to effective implementation of a warehouse

Page 46: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

46

Why Partitioning?

Flexibility in managing dataSmaller physical units allow

easy restructuring free indexing sequential scans if needed easy reorganization easy recovery easy monitoring

Page 47: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

47

Criterion for Partitioning

Typically partitioned by date line of business geography organizational unit any combination of above

Page 48: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

48

Partitioning Example

An insurance company may partition its data as follows: 1995 medical claims 1995 life claims 1996 medical claims 1996 life claims

Page 49: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

49

Where to Partition?

Application level or DBMS levelMakes sense to partition at

application level Allows different definition for each year

Important since warehouse spans many years and as business evolves definition changes

Allows data to be moved between processing complexes easily

Page 50: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

50

Denormalization

Normalization in a data warehouse may lead to lots of small tables

Can lead to excessive I/O’s since many tables have to be accessed

Denormalization is the answer especially since updates are rare

Page 51: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

51

Denormalization

Create ArraysSelective RedundancyDerived Data

Page 52: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

52

Creating Arrays

Many time each occurrence of a sequence of data is in a different physical location

Beneficial to collect all occurrences together and store as an array in a single row

Makes sense only if there are a stable number of occurrences which are accessed together

In a data warehouse, such situations arise naturally due to time based orientation can create an array by month

Page 53: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

53

Selective Redundancy

Description of an item can be stored redundantly with order table -- most often item description is also accessed with order table

Updates have to be careful

Page 54: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

54

Vertical Partitioning

acctno balance address date opened . . . .

acctno balance

acctno address date -opened . . .

Frequentlyaccessed

Rarely accessed

Smaller tableand so less I/O

Page 55: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

55

Derived Data

Introduction of derived (calculated data) may often help

Have seen this in the context of dual levels of granularity

Can keep auxiliary views and indexes to speed up query processing

Page 56: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

DATA TRANSFORMATION

Page 57: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

57

Loading the Warehouse

Load is a crucial component for the success of warehouse project

Issues: Sources of data for the warehouse Data quality at the sources Data Transformation How to propagate updates (on the sources) to

the warehouse Terabytes of data to be loaded

Page 58: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

58

Source Data

Typically host based, legacy applications Customized applications, COBOL, 3GL, 4GL

Point of Contact Devices POS, ATM, Call switches

External Sources Vendors, Delivery Partners, Agents

Sequential Legacy Relational ExternalOperational/Source Data

Page 59: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

59

Data Quality - The Reality

Tempting to think that all that is there to creating a data warehouse is extracting operational data and entering into a data warehouse

Nothing could be farther from the truth

Warehouse data comes from disparate questionable sources

Page 60: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

60

Data Quality - The Reality

Legacy systems no longer documentedOutside sources with questionable

quality proceduresProduction systems with no built in

integrity checks and no integration Operational systems are usually designed

to solve a specific business problem and are rarely developed to a a corporate plan“… And get it done quickly, we do not have

time to worry about corporate standards...”

Page 61: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

61

Data Transformation

Data transformation is the foundation for achieving single version of the truth

Major concern for ITData warehouse can fail if appropriate

data transformation strategy is not developed

Sequential Legacy Relational ExternalOperational/Source Data

Data Transformation

Accessing Capturing Extracting Householding FilteringReconciling Conditioning Loading Validating Scoring

Page 62: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

62

Data Integration Across Sources

Trust Credit cardSavings Loans

Same data different name

Different data Same name

Data found here nowhere else

Different keyssame data

Page 63: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

63

Data Transformation Example

enco

ding

unit

fiel

d

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,fappl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

Page 64: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

64

Data Integrity Problems

Same person, different spellings Rajiv, Rajeev, Raju...

Multiple ways to denote company name Novatek Systems, NISL, Novatek Pvt. Ltd.

Use of different names Mumbai, Bombay Bill, William

Different account numbers generated by different applications for the same customer

Required fields left blank Invalid product codes collected at point of sale

manual entry leads to mistakes “in case of a problem use 9999999”

Page 65: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

65

Data Transformation Terms

ExtractingConditioningScrubbingMergingHouseholding

EnrichmentScoringLoadingValidatingDelta Updating

Page 66: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

66

Data Transformation Terms

Extracting Capture of data from operational source in

“as is” status Sources for data generally in legacy

mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix

Conditioning The conversion of data types from the

source to the target data store (warehouse) -- always a relational database

Page 67: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

67

Data Transformation Terms

Scrubbing Ensuring all data meets the input validation

rules which should have been in place when the data was captured by the operational system. E.g..., null values for data declared not null, numeric in non-numeric, proper zip codes etc...

Merging Bringing together data from operational

sources. Choosing information from each functional system to populate the single occurrence of the data item in the warehouse

Page 68: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

68

Data Transformation Terms

Householding Identifying all members of a household

(living at the same address) Ensures only one mail is sent to a

household Can result in substantial savings: 1

million catalogues at Rs. 50 each costs Rs. 50 million . A 2% savings would save Rs. 1 million

Page 69: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

69

Data Transformation Terms

Enrichment Bring data from external sources to

augment/enrich operational data. (e.g. foreign currency fluctuations over period) Data sources include newspaper groups, consulting firms etc...

Scoring computation of a probability of an event.

e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

Page 70: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

70

Data Transformation Terms

Loading placing data into the warehouse --

accomplished using a load utility provided by database vendors

Validating Process of ensuring that the data

captured is accurate and transformation process is correct

Page 71: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

71

Data Transformation Terms

Delta Updating propagation of changes to source since

last extraction load smaller subsets into data

warehouseMetadata

Data dictionary for the warehouse

Page 72: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

72

Loads

After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse

Issues huge volumes of data to be loaded small time window available when warehouse can be

taken off line (usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume,

change load rates Recover gracefully -- restart after failure from where you

were and without loss of data integrity

Page 73: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

73

Load Techniques

Use SQL to append or insert new data record at a time interface will lead to random disk I/O’s

Use batch load utility

Page 74: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

74

Batch Load Utility

Sort input records on clustering key Sequential I/O significantly faster than random

I/O Single pass load

perform all transformations, scrub, clean, validate, aggregate etc.

build indexes at same time and create summary tables

Sequential loads may take long (loading a TB warehouse may take ~100 days) Exploit I/O Parallelism to load data at acceptable rates

Leverage knowledge of data warehouse schemas

Page 75: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

75

Load Taxonomy

Incremental versus Full loadsOnline versus Offline loads

Page 76: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

76

Incremental Load

Full load is too disruptive and not required if updates since last load can be identified easily

Can use incremental load to reduce data actually loaded insert only updated tuples

Page 77: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

77

Online Load

Online full loads can load a new table while queries on old table continue if there is enough disk space

Online incremental loads conflict with queries break into shorter transactions (every 1000

records or every so many seconds) coordinate this sequence of transactions: must

ensure consistency between base and derived tables and indices

Page 78: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

78

Refresh

Propagate updates on source data to the warehouse

Issues: when to refresh how to refresh -- refresh techniques

Page 79: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

79

When to Refresh?

periodically (e.g., every night, every week) or after significant events

on every update: not warranted unless warehouse data require current data (up to the minute stock quotes)

refresh policy set by administrator based on user needs and traffic

possibly different policies for different sources

Page 80: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

80

Refresh Techniques

Full Extract from base tables read entire source table: too expensive maybe the only choice for legacy

systems

Page 81: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

81

Refresh techniques

Incremental techniques detect changes on base tables: replication

servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots (Oracle)transaction shipping (Sybase)

compute changes to derived and summary tables

maintain transactional correctness for incremental load

Page 82: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

82

How To Detect Changes

Create a snapshot log table to record ids of updated rows of source data and timestamp

Detect changes by: Defining after row triggers to update

snapshot log when source table changes Using regular transaction log to detect

changes to source data

Page 83: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

DBMS

FOR DATAWAREHOUSING

Page 84: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

84

Relational DBMS

Features that support DSS Specialized Indexing techniques Specialized join and scan methods data partitioning and use of parallelism complex query processing intelligent aggregate processing extensions to SQL and their processing

Page 85: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

85

Indexing Techniques

Bitmap index: A collection of bitmaps -- one for each

distinct value of the column Each bitmap has N bits where N is the

number of rows in the table A bit corresponding to a value v for a

row r is set if and only if r has the value for the indexed attribute

Page 86: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

86Customer Query : select * from customer where

gender = ‘F’ and vote = ‘Y’

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

Bitmap Index

M

F

F

F

F

M

Y

Y

Y

N

N

N

Page 87: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

87

Bitmap Indexing

Bit arithmetic (AND/OR) is fastSpace occupied depends on cardinality

of domains Not good if indexed attribute has too many

distinct valuesCan be compressed (e.g. using run

length encoding) effectivelyProducts that support bitmaps: Model

204, TargetIndex(Redbrick), IQ (Sybase), Oracle 7.3

Page 88: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

88

Join Indexes

Pre-computed joinsA join index between a fact table and

a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute e.g., a join index on city dimension of

calls fact table correlates for each city the calls (in the

calls table) that originated from that city

Page 89: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

89

Join Indexes

Join indexes can also span multiple dimension tables e.g., a join index on city and time

dimension of calls fact table

Page 90: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

90

Star Join Processing

Use join indexes to join dimension and fact table

CallsC+T

C+T+L

C+T+L+P

Time

Loca-tion

Plan

Page 91: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

91

Optimized Star Join Processing

Time

Loca-tion

Plan

Calls

Virtual Cross Productof T, L and P

Apply Selections

Page 92: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

92

Bitmapped Join Processing

AND

Time

Loca-tion

Plan

Calls

Calls

Calls

Bitmaps101

001

110

Page 93: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

93

Intelligent Scan

Piggyback multiple scans of a relation (Redbrick) piggybacking also done if second scan

starts a little while after the first scan

Page 94: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

94

Parallel Query Processing

Three forms of parallelism Independent Pipelined Partitioned and “partition and replicate”

Deterrents to parallelism startup communication

Page 95: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

95

Parallel Query Processing

Partitioned Data Parallel scans Yields I/O parallelism

Parallel algorithms for relational operators Joins, Aggregates, Sort

Parallel Utilities Load, Archive, Update, Parse, Checkpoint,

Recovery Parallel Query Optimization

Page 96: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

96

Pre-computed Aggregates

Keep aggregated data for efficiency (pre-computed queries)

Questions Which aggregates to compute? How to update aggregates? How to use pre-computed aggregates

in queries?

Page 97: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

97

Pre-computed Aggregates

Aggregated table can be maintained by the warehouse server middle tier client applications

Pre-computed aggregates -- special case of materialized views -- same questions and issues remain

Page 98: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

98

SQL Extensions

Extended family of aggregate functions rank (top 10 customers) percentile (top 30% of customers) median, mode Object Relational Systems allow

addition of new aggregate functions

Page 99: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

99

SQL Extensions

Reporting features running total, cumulative totals

Cube operator group by on all subsets of a set of

attributes (month,city) redundant scan and sorting of data can

be avoided

Page 100: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

100

Technological Requirements

Managing Large amounts of dataManaging multiple media -- storage

hierarchy cache (L1 and L2) main memory disks optical disks tapes fiche

Page 101: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

101

Technological Requirements

Ability to index data at will temporary indices, sparse indices

Ability to monitor data freely and easily to determine whether reorganization is

required to determine if index is poorly structured to determine statistical composition of data

Need to interface to many technologies for both receiving and passing data

Page 102: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

102

Technological Requirements

Programmer/Designer control of dataParallel Storage/Management of dataGood Metadata managementLoad the warehouse efficientlyUse indexes efficientlyCompaction of data

Page 103: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

103

Technological Requirements

Compound KeysVariable Length dataLock Management

Need to be able to turn the lock manager on and off

Index Only processing

Page 104: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

104

Warehouse Server Products

Oracle 8Informix

Online Dynamic Server for SMP Extended Parallel Server for MPP Universal Server for object relational

applicationsSybase

Adaptive Server 11.5 Sybase MPP Sybase IQ

Page 105: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

105

Warehouse Server Products

MS-SQL Server, OLAP serverRed Brick WarehouseTandem NonstopIBM

DB2 MVS Universal Server DB2 400

Teradata

Page 106: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

106

Server Scalability

Scalability is the #1 IT requirement for Data Warehousing

Hardware Platform options SMP Clusters (shared disk) MPP

Loosely coupled (shared nothing)Hybrid

Page 107: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

107

SMP Characteristics

SMP -- Symmetric multi processing -- shared everything

Multiple CPUs share same memory

Workload is balanced across CPUs by OS

Scalability is limited to bandwidth of internal bus and OS architecture

Not tolerant to failure in processing node

Architecture is mostly invisible to applications

Page 108: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

108

SMP Benefits

Lower entry point -- can start with SMP

Mature technology

Page 109: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

109

MPP Characteristics

Each node owns a portion of the database

Nodes are connected via an interconnection network

Each node can be a single CPU or SMPLoad balancing done by applicationHigh scalability due to local processing

isolation

Page 110: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

110

MPP benefits

High availabilityHigh scalability

Page 111: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

111

Sizing your system

Estimate Total volume of data Total disk throughput required

Determine number of controllers and disks required

Determine CPU and memory based on the workload

Page 112: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

112

Other Warehouse Related Products

Connectivity to Sources Apertus Information Builders EDA/SQL Platimum Infohub SAS Connect IBM Data Joiner Oracle Open Connect Informix Express Gateway

Page 113: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

113

Other Warehouse Related Products

Data extract, clean, transform, refresh CA-Ingres replicator Carleton Passport Prism Warehouse Manager SAS Access Sybase Replication Server Platinum Inforefiner, Infopump

Page 114: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

114

Other warehouse related products

Multidimensional Database Engines Arbor Essbase Oracle IRI Express SAS System

ROLAP servers HP Intelligent Warehouse Informix metacube MicroStrategy DSS server

Page 115: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

115

Other Warehouse Related Products

Query/Reporting Environments Brio/Query Cognos Impromptu Informix Viewpoint CA Visual Express Business Objects Platinum Forest and Trees

Page 116: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

116

Other Warehouse Related Products

Multidimensional Analysis Andyne Pablo Arbor Essbase Analysis Server Cognos Powerplay Holistic Systems (HOLOS) Microstrategy DSS SAS OLAP++

Page 117: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

BUILDING A SUCCESSFUL

DATAWAREHOUSE

Page 118: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

118

From The Standish Group

TSG: How many Data Warehouses do you have?

Data Warehouser: We have had eight.

TSG: To what do you attribute so many warehouses?

Data Warehouser: Seven mistakes ...

Page 119: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

119

For a Successful Warehouse

From day one establish that warehousing is a joint user/builder project

Establish that maintaining data quality will be an ONGOING joint user/builder responsibility

Train the users one step at a timeConsider doing a high level corporate data

model in no more than three weeks

From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html

Page 120: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

120

For a Successful Warehouse

Look closely at the data extracting, cleaning, and loading tools

Implement a user accessible automated directory to information stored in the warehouse

Determine a plan to test the integrity of the data in the warehouse

From the start get warehouse users in the habit of 'testing' complex queries

Page 121: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

121

For a Successful Warehouse

Coordinate system roll-out with network administration personnel

When in a bind, ask others who have done the same thing for advice

Be on the lookout for small, but strategic, projects

Market and sell your data warehousing systems

Page 122: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

122

Data Warehouse Pitfalls

You are going to spend much time extracting, cleaning, and loading data

Despite best efforts at project management, data warehousing project scope will increase

You are going to find problems with systems feeding the data warehouse

You will find the need to store data not being captured by any existing system

You will need to validate data not being validated by transaction processing systems

Page 123: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

123

Data Warehouse Pitfalls

Some transaction processing systems feeding the warehousing system will not contain detail

Many warehouse end users will be trained and never or seldom apply their training

After end users receive query and report tools, requests for IS written reports may increase

Your warehouse users will develop conflicting business rules

Large scale data warehousing can become an exercise in data homogenizing

Page 124: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

124

Data Warehouse Pitfalls

'Overhead' can eat up great amounts of disk space The time it takes to load the warehouse will

expand to the amount of the time in the available window... and then some

Assigning security cannot be done with a transaction processing system mindset

You are building a HIGH maintenance system You will fail if you concentrate on resource

optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer

Page 125: The Data Warehouse This is where the data lives. 2 Agenda zWhat is a Warehouse? zData Warehouse Architecture zData Storage zData Transformation zDBMS

THANK YOU