csci6405 fall 2003 dta mining and data warehousingxwang/courses/cs6405/note2.3.pdfcsci6405 fall 2003...

7 October 2003 1

CSCI6405 Fall 2003Dta Mining and Data Warehousing

Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Assistant: Christopher Jordan,Email: [email protected] Hours: TR, 1:30 - 3:00 PM

7 October 2003 2

Lectures OutlinePat I: Overview on DM and DW

1. Introduction (ch1) Ass1 Due: Sep 23 Tue2. Data preprocessing (ch3)

Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14

Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4)5. Classification data mining (ch7) Ass3: Oct 7 – Oct 216. Association data mining (ch6) Ass4: Oct 21 – Nov 57. Characterization data mining (ch5)8. Clustering data mining (ch8)

Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9)

10. Mining spatial data (Ch9)Project Presentations

Project Due: Dec 8

7 October 2003 3

Reservation of the LCD Lab:

Wed: 8:30 am – 2:00 pmSat: 12:00 pm - 6:00 pmSun: 12:00 pm – 6:00 pm

7 October 2003 4

2. DATA WAREHOUSING AND OLAP (Ch2)

Objectives of DW/OLAP What is a DW?Multidimensional Data ModelDW SchemasAggregationsOLAP OperationsDW ArchitectureFrom data warehousing to data mining

7 October 2003 5

How to define DW schema: a data mining query language: DMQL

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>Dimension Definition ( Dimension Table )define dimension <dimension_name> as

(<attribute_or_subdimension_list>)Special Case (Shared Dimension Tables)

First time as “cube definition”define dimension <dimension_name> as<dimension_name_first_time> in cube <cube_name_first_time>

7 October 2003 6

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

7 October 2003 7

Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month,

quarter, year)define dimension item as (item_key, item_name, brand, type,

supplier_type)define dimension branch as (branch_key, branch_name,

branch_type)define dimension location as (location_key, street, city,

province_or_state, country)

7 October 2003 8

Example of Snowflake Schema


time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item


branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

7 October 2003 9

Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

7 October 2003 10

Example of Fact Constellation


time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item


branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

7 October 2003 11

Defining a Fact Constellation in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),

units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,

country)define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in

cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

7 October 2003 12

How hierarchical data are materialized in a data warehouse ?

7 October 2003 13

Aggregations

- To measure a business event

What do I want to look at? What am I trying to compare?

* define a grouping (i.e. determine a cuboid of the data cube),* measure the fact about the event (I.e., the cuboid)

| retrieval a pre-calculated value, or invoke an aggregate function

* OLAP query: Dimension-value pairs.

E.g., dimension: <time="Q1", location="Vancouver", item="Computer">value (measured): sales=sum (the data set).

A measure value is computed for a defined cuboid by aggregating the data corresponding to the respective dimension-value pairs defining the given event.

7 October 2003 14

Measures: Three Categories

* Distributive functions: A aggregate function is distributive if a set is divided into n subsets, use the function to calculate the set and the subsets, and the result from the set

and the total result from the n subset are same.E.g., count(), sum(), min(), max().

* Algebraic functions: A aggregate function is algebraic if it can be calculated by analgebraic function with M arguments, and each argument is a distributive aggregation function.

E.g., ave() = sum() / count(), standard_deviation(), ...

* Holistic functions: A aggregate function is holistic if it characterizes a set element (s) relative to other elements of the set without an algebraic calculation.E.g., rank(), median(), ...

Distributive and algebraic aggregate functions are most frequently used and can be calculated efficiently. In contrast holistic aggregate functions can not be efficiently calculated in general which are not used in data warehouses.

7 October 2003 15

Pre-aggregation vs. On-line aggregation

Pre-aggregation: all needed calculations are done by batch process.

On-line aggregation: the aggregating computation is on-line. The main issue is the data volume to be aggregated is normally very large. On-line aggregation results in real time aggravation.

The manager's rule of thumb:- An average aggregation should response from the data warehousing

system in 20 seconds or under.

7 October 2003 16

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L levels?

E.g. The cube has 10 dimensions and 4 levels for each dimension:

5^10 = 9.8 x 10^6.

Materialization of data cube

Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

)11

( +∏=

=n

i iLT

7 October 2003 17

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

7 October 2003 18

OLAP Operations

Roll up (drill-up): summarize databy climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-upfrom higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operationsdrill across: involving (across) more than one fact table, etc

7 October 2003 19

A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

7 October 2003 20

7 October 2003 21

Example of data warehousing using MS SQL server 2000

7 October 2003 22

7 October 2003 23

7 October 2003 24Drill down to see product categories.

7 October 2003 25Drill down to see product “Clams” sales information

7 October 2003 26

DW Development Procedure

1. Choose a business process to model, understand the complexity of data, determine a data schema to use, etc.

2. Decide subject(s), choose the measures that will populate each fact table record.3. Choose fact table: the grain and measures of the subject:

The fundamental, atomic level of data to be represented in the fact table, such as daily or weekly sales, etc.

4. Choose the dimensions that will apply to each fact table record.

7 October 2003 27

Data Warehouse Development: An Incremental Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

7 October 2003 28

Data Warehouse ArchitectureThe architecture of data:

Abstraction level Business rules

| Metadata

| Schema

| Summary data

| Operational data

The abstraction hierarchy of data and its description helps users navigate around a data warehouse. As data gets more abstract, it generally gets less voluminous.

7 October 2003 29

- Operational data: who, what, where, and when

- Summary data: summaries by who, what, where, and when

- Schema: physical layout of the data, tables, fields, indexes, types

- Metadata: logical model and mappings to physical layout and sources (by defining the data in business terms)

- Business rules: what's been learned from the data

The architecture of data (cont)

7 October 2003 30

Multitiers architecture:

• Client site: The end user can query and visualize data on the local computer or connect up to a display server that has access to the DW.

• Middle server: Logically, OLAP engines present the users with multidimensional data from DWs or data marts. However, the physical architecture implementation issues must be considered for OLAP engines.

• DW server: Data warehouse generated from relational or operational databases, gateways for extraction and integration of multiple data sources: ODBC (Open Database Connection), and OLEDB (Open Linking and Embedding for Databases), and JDBC (Java Database Connections), etc

7 October 2003 31

MultiMulti--Tiered ArchitectureTiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

OperationalDBs

othersources

Data Storage

OLAP Server

7 October 2003 32

Data Warehouse Back-End Tools and Utilities

Data extraction:get data from multiple, heterogeneous, and external sources

Data cleaning:detect errors in the data and rectify them when possible

Data transformation:convert data from legacy or host format to warehouse format

Load:sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

Refreshpropagate the updates from the data sources to the warehouse

7 October 2003 33

OLAP Server Architectures

Multidimensional OLAP (MOLAP)Implemented as a large multidimensional arrayFast indexing to pre-computed summarized data (with built-in indexing)Not proven to scale effectively to large, high-dimensionality data sets

Relational OLAP (ROLAP)Implemented as a collection of relational tablesCan be processed and queried with traditional RDBMS technology (I.e. indexes and joins etc)Greater scalabilityNo “built-in” indexing

Hybrid OLAP (HOLAP)User flexibility, e.g., low level: relational, high-level: array MS SQL Server 2000

E.g. The same data stored in a multidimensional array for MOLAP, and multi-tables for RLOAP (the distributed sheet).

7 October 2003 34

From On-Line Analytical Processing to On Line Analytical Mining (OLAM)

Why online analytical mining?High quality of data in data warehouses

DW contains integrated, consistent, cleaned dataAvailable information processing structure surrounding data warehouses

ODBC (Open Data Base Connectivity), Web accessing, service facilities, reporting and OLAP tools

OLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.

On-line selection of data mining functionsintegration and swapping of multiple mining functions, algorithms, and tasks.

Architecture of OLAM

7 October 2003 35

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

7 October 2003 36

Summary

Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process

A multi-dimensional model of a data warehouseMultidimensional data model Star schema, snowflake schema, fact constellationsA data cube consists of identifier dimensions & measure dimension

Concept hierarchies OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAP…

csci6405 fall 2003 dta mining and data warehousingxwang/courses/cs6405/note2.3.pdfcsci6405 fall 2003...

Documents