csci6405 fall 2003 dta mining and data warehousingxwang/courses/cs6405/note2.3.pdfcsci6405 fall 2003...
TRANSCRIPT
7 October 2003 1
CSCI6405 Fall 2003Dta Mining and Data Warehousing
Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Assistant: Christopher Jordan,Email: [email protected] Hours: TR, 1:30 - 3:00 PM
7 October 2003 2
Lectures OutlinePat I: Overview on DM and DW
1. Introduction (ch1) Ass1 Due: Sep 23 Tue2. Data preprocessing (ch3)
Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14
Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4)5. Classification data mining (ch7) Ass3: Oct 7 – Oct 216. Association data mining (ch6) Ass4: Oct 21 – Nov 57. Characterization data mining (ch5)8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)Project Presentations
Project Due: Dec 8
7 October 2003 3
Reservation of the LCD Lab:
Wed: 8:30 am – 2:00 pmSat: 12:00 pm - 6:00 pmSun: 12:00 pm – 6:00 pm
7 October 2003 4
2. DATA WAREHOUSING AND OLAP (Ch2)
Objectives of DW/OLAP What is a DW?Multidimensional Data ModelDW SchemasAggregationsOLAP OperationsDW ArchitectureFrom data warehousing to data mining
7 October 2003 5
How to define DW schema: a data mining query language: DMQL
Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list>Dimension Definition ( Dimension Table )define dimension <dimension_name> as
(<attribute_or_subdimension_list>)Special Case (Shared Dimension Tables)
First time as “cube definition”define dimension <dimension_name> as<dimension_name_first_time> in cube <cube_name_first_time>
7 October 2003 6
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
7 October 2003 7
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month,
quarter, year)define dimension item as (item_key, item_name, brand, type,
supplier_type)define dimension branch as (branch_key, branch_name,
branch_type)define dimension location as (location_key, street, city,
province_or_state, country)
7 October 2003 8
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycitystate_or_provincecountry
city
7 October 2003 9
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
7 October 2003 10
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_statecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
7 October 2003 11
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,
country)define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
7 October 2003 12
How hierarchical data are materialized in a data warehouse ?
7 October 2003 13
Aggregations
- To measure a business event
What do I want to look at? What am I trying to compare?
* define a grouping (i.e. determine a cuboid of the data cube),* measure the fact about the event (I.e., the cuboid)
| retrieval a pre-calculated value, or invoke an aggregate function
* OLAP query: Dimension-value pairs.
E.g., dimension: <time="Q1", location="Vancouver", item="Computer">value (measured): sales=sum (the data set).
A measure value is computed for a defined cuboid by aggregating the data corresponding to the respective dimension-value pairs defining the given event.
7 October 2003 14
Measures: Three Categories
* Distributive functions: A aggregate function is distributive if a set is divided into n subsets, use the function to calculate the set and the subsets, and the result from the set
and the total result from the n subset are same.E.g., count(), sum(), min(), max().
* Algebraic functions: A aggregate function is algebraic if it can be calculated by analgebraic function with M arguments, and each argument is a distributive aggregation function.
E.g., ave() = sum() / count(), standard_deviation(), ...
* Holistic functions: A aggregate function is holistic if it characterizes a set element (s) relative to other elements of the set without an algebraic calculation.E.g., rank(), median(), ...
Distributive and algebraic aggregate functions are most frequently used and can be calculated efficiently. In contrast holistic aggregate functions can not be efficiently calculated in general which are not used in data warehouses.
7 October 2003 15
Pre-aggregation vs. On-line aggregation
Pre-aggregation: all needed calculations are done by batch process.
On-line aggregation: the aggregating computation is on-line. The main issue is the data volume to be aggregated is normally very large. On-line aggregation results in real time aggravation.
The manager's rule of thumb:- An average aggregation should response from the data warehousing
system in 20 seconds or under.
7 October 2003 16
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
E.g. The cube has 10 dimensions and 4 levels for each dimension:
5^10 = 9.8 x 10^6.
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)11
( +∏=
=n
i iLT
7 October 2003 17
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
7 October 2003 18
OLAP Operations
Roll up (drill-up): summarize databy climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-upfrom higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operationsdrill across: involving (across) more than one fact table, etc
7 October 2003 19
A Star-Net Query Model
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
CITY
COUNTRY
REGION
Location
DAILYQTRLYANNUALYTime
Each circle is called a footprint
7 October 2003 20
7 October 2003 21
Example of data warehousing using MS SQL server 2000
7 October 2003 22
7 October 2003 23
7 October 2003 24Drill down to see product categories.
7 October 2003 25Drill down to see product “Clams” sales information
7 October 2003 26
DW Development Procedure
1. Choose a business process to model, understand the complexity of data, determine a data schema to use, etc.
2. Decide subject(s), choose the measures that will populate each fact table record.3. Choose fact table: the grain and measures of the subject:
The fundamental, atomic level of data to be represented in the fact table, such as daily or weekly sales, etc.
4. Choose the dimensions that will apply to each fact table record.
7 October 2003 27
Data Warehouse Development: An Incremental Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
7 October 2003 28
Data Warehouse ArchitectureThe architecture of data:
Abstraction level Business rules
| Metadata
| Schema
| Summary data
| Operational data
The abstraction hierarchy of data and its description helps users navigate around a data warehouse. As data gets more abstract, it generally gets less voluminous.
7 October 2003 29
- Operational data: who, what, where, and when
- Summary data: summaries by who, what, where, and when
- Schema: physical layout of the data, tables, fields, indexes, types
- Metadata: logical model and mappings to physical layout and sources (by defining the data in business terms)
- Business rules: what's been learned from the data
The architecture of data (cont)
7 October 2003 30
Multitiers architecture:
• Client site: The end user can query and visualize data on the local computer or connect up to a display server that has access to the DW.
• Middle server: Logically, OLAP engines present the users with multidimensional data from DWs or data marts. However, the physical architecture implementation issues must be considered for OLAP engines.
• DW server: Data warehouse generated from relational or operational databases, gateways for extraction and integration of multiple data sources: ODBC (Open Database Connection), and OLEDB (Open Linking and Embedding for Databases), and JDBC (Java Database Connections), etc
7 October 2003 31
MultiMulti--Tiered ArchitectureTiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
OperationalDBs
othersources
Data Storage
OLAP Server
7 October 2003 32
Data Warehouse Back-End Tools and Utilities
Data extraction:get data from multiple, heterogeneous, and external sources
Data cleaning:detect errors in the data and rectify them when possible
Data transformation:convert data from legacy or host format to warehouse format
Load:sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
Refreshpropagate the updates from the data sources to the warehouse
7 October 2003 33
OLAP Server Architectures
Multidimensional OLAP (MOLAP)Implemented as a large multidimensional arrayFast indexing to pre-computed summarized data (with built-in indexing)Not proven to scale effectively to large, high-dimensionality data sets
Relational OLAP (ROLAP)Implemented as a collection of relational tablesCan be processed and queried with traditional RDBMS technology (I.e. indexes and joins etc)Greater scalabilityNo “built-in” indexing
Hybrid OLAP (HOLAP)User flexibility, e.g., low level: relational, high-level: array MS SQL Server 2000
E.g. The same data stored in a multidimensional array for MOLAP, and multi-tables for RLOAP (the distributed sheet).
7 October 2003 34
From On-Line Analytical Processing to On Line Analytical Mining (OLAM)
Why online analytical mining?High quality of data in data warehouses
DW contains integrated, consistent, cleaned dataAvailable information processing structure surrounding data warehouses
ODBC (Open Data Base Connectivity), Web accessing, service facilities, reporting and OLAP tools
OLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.
On-line selection of data mining functionsintegration and swapping of multiple mining functions, algorithms, and tasks.
Architecture of OLAM
7 October 2003 35
An OLAM Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
7 October 2003 36
Summary
Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process
A multi-dimensional model of a data warehouseMultidimensional data model Star schema, snowflake schema, fact constellationsA data cube consists of identifier dimensions & measure dimension
Concept hierarchies OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAP…