data warehousing and olapcmpt 354: database i -- data warehousing and olap 4 data analysis and olap...

29
Data Warehousing and OLAP

Upload: others

Post on 16-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

Data Warehousing and OLAP

Page 2: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 2

Decision Support Systems

• Decision-support systems are used to make business decisions, often based on data collected by online transaction-processing systems

• Examples of business decisions:– What items to stock?– What insurance premium to change?– To whom to send advertisements?

• Examples of data used for making decisions– Retail sales transaction details– Customer profiles (income, age, gender, etc.)

Page 3: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 3

Data and Statistical Analysis

• Data analysis tasks are simplified by specialized tools and SQL extensions– Example tasks

• For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year

• As above, for each product category and each customer category

• Statistical analysis packages (e.g., SAS) can be interfaced with databases

Page 4: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 4

Data Analysis and OLAP

• Online Analytical Processing (OLAP)– Interactive analysis of data, allowing data to be

summarized and viewed in different ways in an online fashion (with negligible delay)

• Multidimensional data: data modeled as dimension attributes and measure attributes – Dimension attributes: define the dimensions on which

measure attributes (or aggregates thereof) are viewed, e.g. the attributes item_name, color, and size of the sales relation

– Measure attributes: can be aggregated upon, e.g., the attribute number of the sales relation

Page 5: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 5

Pivot Table• Values for one of the dimension attributes form the row headers• Values for another dimension attribute form the column headers• Other dimension attributes are listed on top• Values in individual cells are (aggregates of) the values of the

dimension attributes that specify the cell

Page 6: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 6

Relational Representation

• Cross-tabs can be represented as relations– The value all is used to

represent aggregates– All represents a set– The SQL:1999

standard uses null values in place of alldespite confusion with regular null values

Page 7: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 7

Data Cubes

• A data cube is a multidimensional generalization of a cross-tab

• Can have n dimensions• Cross-tabs can be used as views on a data

cube

Page 8: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 8

Online Analytical Processing

• Pivoting: changing the dimensions used in a cross-tab is called

• Slicing: creating a cross-tab for fixed values only– Sometimes called dicing, particularly when values for

multiple dimensions are fixed• Rollup: moving from finer-granularity data to a

coarser granularity • Drill down: The opposite operation - that of

moving from coarser-granularity data to finer-granularity data

Page 9: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 9

Hierarchies on Dimensions

• Enable dimensions be viewed at different levels of detail– Dimension DateTime can be used to aggregate by hour

of day, date, day of week, month, quarter or year

Page 10: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 10

Cross Tabulation With Hierarchy

Page 11: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 11

OLAP Implementation

• Multidimensional OLAP (MOLAP) systems– Multidimensional arrays in memory to store data

cubes• Relational OLAP (ROLAP) systems

– Relational tables to store data cubes• Hybrid OLAP (HOLAP) systems

– Store some summaries in memory and store the base data and other summaries in a relational database

Page 12: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 12

Extended Aggregation in SQL:1999• The cube operation computes union of group by’s

on every subset of the specified attributesselect item-name, color, size, sum(number)from salesgroup by cube(item-name, color, size)

• Compute the union of eight different groupings of the sales relation: { (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }

• For each grouping, the result contains the null value for attributes not present in the grouping

Page 13: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 13

OLTP Versus OLAPOLTP OLAP

users clerk, IT professional knowledge workerfunction day to day operations decision supportDB design application-oriented subject-orienteddata current, up-to-date, detailed, flat

relational Isolatedhistorical, summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hocaccess read/write, index/hash on prim.

keylots of scans

unit of work short, simple transaction complex query# records accessed

tens millions

#users thousands hundredsDB size 100MB-GB 100GB-TBmetric transaction throughput query throughput, response

Page 14: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 14

What Is a Data Warehouse?

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

– W. H. Inmon• Data warehousing: the process of

constructing and using data warehouses

Page 15: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 15

Subject-Oriented

• Organized around major subjects, such as customer, product, sales

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

• Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 16: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 16

Integrated

• Integrating multiple, heterogeneous data sources– Relational databases, flat files, on-line transaction

records• Data cleaning and data integration

– Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted

Page 17: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 17

Time Variant

• The time horizon for the data warehouse is significantly longer than that of operational systems– Operational database: current value data– Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)• Every key structure in the data warehouse

– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain

“time element”

Page 18: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 18

Nonvolatile

• A physically separate store of data transformed from the operational environment

• Operational update of data does not occur in the data warehouse environment– Does not require transaction processing,

recovery, and concurrency control mechanisms– Requires only two operations in data accessing:

• initial loading of data and access of data

Page 19: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 19

Data Warehousing

Page 20: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 20

Collecting Data• Source driven architecture: data sources transmit

new information to a warehouse, either continuously or periodically (e.g. at night)

• Destination driven architecture: a warehouse periodically requests new information from data sources

• Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive– Usually OK to have slightly out-of-date data at

warehouse– Data/updates are periodically downloaded form online

transaction processing (OLTP) systems

Page 21: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 21

Design Issues

• Data cleansing– Correct mistakes in addresses (misspellings, zip code

errors), and merge address lists from different sources and purge duplicates

• Update propagating– Warehouse schema may be a (materialized) view of

schema from data sources• Summarizing data

– Raw data may be too large to store on-line– Aggregate values (totals/subtotals) often suffice– Queries on raw data can often be transformed by query

optimizer to use aggregate values

Page 22: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 22

Warehouse Schemas

• Dimension values are usually encoded using small integers and mapped to full values via dimension tables

• Resultant schema is called a star schema– More complicated schema structures

• Snowflake schema: multiple levels of dimension tables

• Constellation: multiple fact tables

Page 23: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 23

Data Warehouse Schema

Page 24: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 24

Snowflake Schema

A star schema is a snowflake schema where each dimension has only one single dimension table

Picture from publib.boulder.ibm.com

Page 25: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 25

Why Data Mining?

• Evolution of database technology– To collect a large amount of data primitive

file processing– To store and query data efficiently DBMS

• New challenges: huge amount of data, how to analyze and understand?– Data mining

Page 26: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 26

What Is Data Mining?

• Mining data – mining knowledge• Data mining is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable patterns in data

Page 27: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 27

The KDD Process

Data

Target data

Preprocesseddata

Transformeddata

Patterns

Knowledge

Selection Preprocessing

Transformation

Data mining

Interpretation/evaluation

Page 28: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 28

KDD Process Steps

• Preprocessing– Data cleaning – Data integration

• Data selection• Data transformation• Data mining• Pattern evaluation• Knowledge presentation

Page 29: Data Warehousing and OLAPCMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing

CMPT 354: Database I -- Data Warehousing and OLAP 29

Summary

• OLAP• Data warehousing

– Star schema– Snowflake schema

• Data mining and KDD process