data warehouses 1. what is a data warehouse? a multi-dimensional data model data warehouse...

Post on 18-Jan-2016

229 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Data MiningData Warehouses

2

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

From data warehousing to data mining

Data Warehousing and OLAP Technology: An Overview

3

Defined in many different ways, but not rigorously.

◦ A decision support database that is maintained separately

from the organization’s operational database

◦ Support information processing by providing a solid

platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

management’s decision-making process.”—W. H. Inmon

Data warehousing:

◦ The process of constructing and using data warehouses

What is Data Warehouse?

4

Organized around major subjects, such as

customer, product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations or

transaction processing

Provide a simple and concise view around

particular subject issues by excluding data that

are not useful in the decision support process

Data Warehouse—Subject-Oriented

5

Constructed by integrating multiple, heterogeneous data sources◦ relational databases, flat files, on-line

transaction records Data cleaning and data integration techniques

are applied.◦ Ensure consistency in naming conventions,

encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc.

◦ When data is moved to the warehouse, it is converted.

Data Warehouse—Integrated

6

The time horizon for the data warehouse is significantly longer than that of operational systems

◦ Operational database: current value data

◦ Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

◦ Contains an element of time, explicitly or implicitly

◦ But the key of operational data may or may not contain “time element”

Data Warehouse—Time Variant

7

A physically separate store of data transformed

from the operational environment

Operational update of data does not occur in the

data warehouse environment

◦ Does not require transaction processing,

recovery, and concurrency control mechanisms

◦ Requires only two operations in data accessing: initial loading of data and access of data

Data Warehouse—Nonvolatile

8

OLTP (on-line transaction processing)◦ Major task of traditional relational DBMS◦ Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)

◦ Major task of data warehouse system◦ Data analysis and decision making

Distinct features (OLTP vs. OLAP):◦ User and system orientation: customer vs. market◦ Data contents: current, detailed vs. historical, consolidated◦ Database design: ER + application vs. star + subject◦ View: current, local vs. evolutionary, integrated◦ Access patterns: update vs. read-only but complex queries

Data Warehouse vs. Operational DBMS

9

OLTP vs. OLAP

OLTP OLAP

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

10

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

From data warehousing to data mining

Data Warehousing and OLAP Technology: An Overview

11

A data warehouse is based on a multidimensional data

model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and

viewed in multiple dimensions

◦ Dimension tables, such as item (item_name, brand, type),

or time(day, week, month, quarter, year)

◦ Fact table contains measures (such as dollars_sold) and

keys to each of the related dimension tables

From Tables and Spreadsheets to Data Cubes

12

Modeling data warehouses: dimensions &

measures

◦ Star schema: A fact table in the middle

connected to a set of dimension tables

◦ Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized

into a set of smaller dimension tables, forming a

shape similar to snowflake

Conceptual Modeling of Data Warehouses

13

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

14

Example of Star Schema

15

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

Example of Snowflake Schema

17

Allows data to be modeled and viewed in multiple dimensions

Another representation like star schema Dimensions are entities which you want to

keep records◦ Time, item, branch, location…

Each dimension has a table called dimension table

Data Cubes

18

3-D Data Cube

• 4-D cubes can be visualized as a series of 3-D cubesSupplier 1 Supplier 2 Supplier 3

19

A data cube is often referred as a cuboid Can generate a cubiod for all possible

subsets of dimensions◦ Provide different level of summarization

N-D cube base cuboid◦ Lowest level of summary

0-D cube apex cuboid◦ Highest level of summary◦ Summary over all dimensions

Cuboid

20

Cube: A Lattice of Cuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

21

Start with time-product (2-D) table that shows sale amounts

Ex. Data Cube Gen

TV PC VCR

1st Qtr 1000 850 350

2nd Qtr 1352 940 298

3rd Qtr 1450 658 314

4th Qtr 1500 965 365

USA

22

TV PC VCR TV PC VCR TV PC VCR

1st Q 1000 850 350 2600 750 425 1300 850 350

2nd Q 1352 940 298 1752 860 236 1200 1000 400

3rd Q 1450 658 314 1055 458 520 1150 555 510

4th Q 1500 965 365 1350 1065 390 900 750 425

Ex. Data Cube Gen (3D)

USA Canada Mexico

23

A Sample Data Cube

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

24

Cuboids Corresponding to the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

25

Sales volume as a function of product, month, and region

Multidimensional DataP

rodu

ctReg

ion

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

26

A Concept Hierarchy: Dimension (location)

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

27

Roll up (drill-up): summarize data◦ by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up◦ from higher level summary to lower level summary

or detailed data, or introducing new dimensions Slice and dice:

◦ project and select Pivot (rotate):

◦ reorient the cube, visualization, 3D to series of 2D planes

Typical OLAP Operations

28

29

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

From data warehousing to data mining

Chapter 3: Data Warehousing and OLAP Technology: An Overview

Data Warehouse: A Multi-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

31

Data Mart◦ a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

Meta data is the data defining warehouse objects. It stores:◦ Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, data mart locations and contents

◦ Monitoring information warehouse usage statistics, error reports

◦ Algorithms used for summarization

Warehouse Architecture

32

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

From data warehousing to data mining

Chapter 3: Data Warehousing and OLAP Technology: An Overview

33

Three kinds of data warehouse applications

◦ Information processing supports querying, basic statistical analysis, and

reporting using crosstabs, tables, charts and graphs

◦ Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,

pivoting

◦ Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models,

performing classification and prediction, and presenting the mining results using visualization tools

Data Warehouse Usage

34

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

From data warehousing to data mining

Summary

Chapter 3: Data Warehousing and OLAP Technology: An Overview

35

Why data warehousing? A multi-dimensional model of a data warehouse

◦ Star schema, snowflake schema

◦ A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting Data warehouse architecture

Summary: Data Warehouse and OLAP Technology

top related