cs1004-nol

77
 1 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Major issues in data mining

Upload: sruthy-rajendhren

Post on 09-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 1/323

  1

Chapter 1. Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Major issues in data mining

Page 2: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 2/323

  2

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific

simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated

analysis of massive data sets

Page 3: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 3/323

Page 4: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 4/323

  4

What Is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge

from huge amount of data

Data mining: a misnomer? Alternative names

Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data

archeology, data dredging, information harvesting,business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing

(Deductive) expert systems

Page 5: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 5/323

  5

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discoveryprocess

Data Cleaning

Data Integration

Databases

Data

Warehouse

Task-relevant Da

ta

Selection

Data Mining

Pattern Evaluation

Page 6: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 6/323

  6

KDD Process: Several KeySteps

Learning the application domain relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant

representation

Choosing functions of data mining summarization, classification, regression, association, clustering

Choosing the mining algorithm(s) Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Page 7: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 7/323  7

Data Mining and BusinessIntelligence 

Increasing potential

to support

business decisions End User

BusinessAnalyst

Data

Analyst

DBA

Decision 

Making

Data PresentationVisualization Techniques

Data Mining Information Discovery

Data Exploration Statistical Summary, Querying, and Reporting 

Data Preprocessing/Integration, Data Warehouses

Data Sources

 Paper, Files, Web documents, Scientific experiments, Database Systems

Page 8: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 8/323  8

Data Mining: Confluence of Multiple

Disciplines 

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

Algorithm

OtherDisciplines

Visualization

Page 9: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 9/323  9

Why Not Traditional DataAnalysis?

 Tremendous amount of data Algorithms must be highly scalable to handle such as tera-

bytes of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

 Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

Page 10: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 10/323

Page 11: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 11/323  11

Data Mining: Classification Schemes

General functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be

discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

Page 12: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 12/323  12

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

 Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data Multimedia database

 Text databases

 The World-Wide Web

Page 13: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 13/323  13

Data Mining Functionalities

Multidimensional concept description: Characterization anddiscrimination

Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions

Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and

distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify

cars based on (gas mileage)

Predict some unknown or missing numerical values

Page 14: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 14/323  14

Data Mining Functionalities (2)

Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity

Outlier analysis Outlier: Data object that does not comply with the general behavior of 

the data Noise or exception? Useful in fraud detection, rare events analysis

 Trend and evolution analysis  Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory

Periodicity analysis Similarity-based analysis

Other pattern-directed or statistical analyses

Page 15: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 15/323  15

Major Issues in Data Mining

Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio,

stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge

fusion

User interaction

Data mining query languages and ad-hoc mining Expression and visualization of data mining results

Interactive mining of  knowledge at multiple levels of abstraction

Applications and social impacts Domain-specific data mining & invisible data mining

Protection of data security, integrity, and privacy

Page 16: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 16/323  16

Are All the “Discovered” PatternsInteresting?

Data mining may generate thousands of patterns: Not all of themare interesting

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

A pattern is interesting if it is easily understood by humans, valid onnew or test data with some degree of certainty, potentially useful,

novel, or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g.,

support, confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness,

novelty, actionability, etc.

Page 17: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 17/323  17

Find All and Only InterestingPatterns?

Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do

we need to find all of the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches

First general all the patterns and then filter out the

uninteresting ones

Generate only the interesting patterns—mining query

optimization

Page 18: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 18/323  18

Why Data Mining Query Language?

Automated vs. query-driven? Finding all the patterns autonomously in a database?—

unrealistic because the patterns could be too many but

uninteresting

Data mining should be an interactive process User directs what to be mined

Users must be provided with a set of primitives to be used to

communicate with the data mining system

Incorporating these primitives in a data mining query language More flexible user interaction

Foundation for design of graphical user interface

Standardization of data mining industry and practice

P i iti th t D fi D t Mi i

Page 19: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 19/323  19

Primitives that Define a Data MiningTask 

 Task-relevant data Database or data warehouse name

Database tables or data warehouse cubes

Condition for data selection

Relevant attributes or dimensions Data grouping criteria

 Type of knowledge to be mined Characterization, discrimination, association, classification,

prediction, clustering, outlier analysis, other data mining tasks

Background knowledge

Pattern interestingness measurements

Visualization/presentation of discovered patterns

Page 20: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 20/323  20

Primitive 3: Background Knowledge

A typical kind of background knowledge: Concept hierarchies

Schema hierarchy

E.g., street < city < province_or_state < country

Set-grouping hierarchy

E.g., {20-39} = young, {40-59} = middle_aged

Operation-derived hierarchy

email address: [email protected]

login-name < department < university < country Rule-based hierarchy

low_profit_margin (X) <= price(X, P1) and cost (X, P2) and

(P1 - P2) < $50

r m ve : a ern n eres ngness

Page 21: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 21/323  21

r m ve : a ern n eres ngness

Measure

Simplicitye.g., (association) rule length, (decision) tree size

Certainty

e.g., confidence, P(A|B) = #(A and B)/ #(B), classification

reliability or accuracy, certainty factor, rule strength, rulequality, discriminating weight, etc.

Utility

potential usefulness, e.g., support (association), noise

threshold (description) Novelty

not previously known, surprising (used to remove

redundant rules, e.g., Illinois vs. Champaign rule

implication support ratio)

P i iti 5 P t ti f Di d

Page 22: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 22/323  22

Primitive 5: Presentation of DiscoveredPatterns

Different backgrounds/usages may require different forms of 

representation

E.g., rules, tables, crosstabs, pie/bar chart, etc.

Concept hierarchy is also important Discovered knowledge might be more understandable

when represented at high level of abstraction 

Interactive drill up/down, pivoting, slicing and dicing 

provide different perspectives to data

Different kinds of knowledge require different

representation: association, classification, clustering, etc.

DMQL A Data Mining Query

Page 23: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 23/323

  23

DMQL—A Data Mining QueryLanguage

Motivation

A DMQL can provide the ability to support ad-hoc and

interactive data mining

By providing a standardized language like SQL

Hope to achieve a similar effect like that SQL has

on relational database

Foundation for system development and evolution

Facilitate information exchange, technology

transfer, commercialization and wide acceptance

Design

DMQL is designed with the primitives described earlier

Page 24: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 24/323

  24

An Example Query in DMQL

Page 25: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 25/323

  25

Other Data Mining Languages &Standardization Efforts

Association rule language specifications MSQL (Imielinski & Virmani’99)

MineRule (Meo Psaila and Ceri’96)

Query flocks based on Datalog syntax (Tsur et al’98)

OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft

SQLServer 2005)

Based on OLE, OLE DB, OLE DB for OLAP, C#

Integrating DBMS, data warehouse and data mining

DMML (Data Mining Mark-up Language) by DMG (www.dmg.org)

Providing a platform and process structure for effective data mining

Emphasizing on deploying data mining technology to solve business

problems

Page 26: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 26/323

  26

Integration of Data Mining and DataWarehousing

Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling

On-line analytical mining data

integration of mining and OLAP technologies

Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction

by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions

Characterized classification, first clustering and then association

Page 27: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 27/323

  27

Coupling Data Mining with DB/DWSystems

No coupling—flat file processing, not recommended Loose coupling

Fetching data from DB/DW

Semi-tight coupling—enhanced DM performance

Provide efficient implement a few data mining primitives in aDB/DW system, e.g., sorting, indexing, aggregation, histogram

analysis, multiway join, precomputation of some stat functions

 Tight coupling—A uniform information processing

environment

DM is smoothly integrated into a DB/DW system, mining query is

optimized based on mining query, indexing, query processing

methods, etc.

Page 28: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 28/323

  28

Architecture: Typical Data MiningSystem

data cleaning, integration, and selection

Database or DataWarehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

DatabaseData

Warehouse

World-Wide

Web

Other Info

Repositories

Page 29: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 29/323

  29

What is Data Warehouse?

Defined in many different ways, but not rigorously. A decision support database that is maintained separately from

the organization’s operational database

Support information processing by providing a solid platform of 

consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of 

management’s decision-making process.”—W. H. Inmon

Data warehousing:

 The process of constructing and using data warehouses

a a are ouse u ec

Page 30: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 30/323

  30

a a are ouse— u ec -Oriented

Organized around major subjects, such as

customer, product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations or

transaction processing

Provide a simple and concise view around

particular subject issues by excluding data thatare not useful in the decision support process

Page 31: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 31/323

  31

Data Warehouse—Integrated

Constructed by integrating multiple,heterogeneous data sources relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques areapplied.

Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different datasources E.g., Hotel price: currency, tax, breakfast covered,

etc. When data is moved to the warehouse, it is converted.

a a are ouse me

Page 32: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 32/323

  32

a a are ouse— meVariant

 The time horizon for the data warehouse is

significantly longer than that of operational

systems Operational database: current value data

Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain

“time element”

a a are ouse

Page 33: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 33/323

  33

a a are ouse—Nonvolatile

A physically separate store of data transformed

from the operational environment

Operational update of data does not occur in the

data warehouse environment Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data

Data Warehouse vs

Page 34: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 34/323

  34

Data Warehouse vs.Heterogeneous DBMS

 Traditional heterogeneous DB integration: A query driven 

approach

Build wrappers/mediators on top of heterogeneous databases

When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate for

individual heterogeneous sites involved, and the results are

integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in

advance and stored in warehouses for direct query and analysis

ata are ouse vs

Page 35: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 35/323

  35

ata are ouse vs.Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries

Page 36: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 36/323

  36

OLTP vs. OLAP

OLTP OLAPusers clerk, IT professional knowledge worker 

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relationalisolated

historical,

summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on prim. key

lots of scans

unit of work  short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 37: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 37/323

  37

 Warehouse?

High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency

control, recovery

Warehouse—tuned for OLAP: complex OLAP queries,

multidimensional view, consolidation

Different functions and different data: missing data: Decision support requires historical data which

operational DBs do not typically maintain

data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled

Note: There are more and more systems which perform

OLAP analysis directly on relational databases

Page 38: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 38/323

Chapter 3: Data Generalization, Data

Page 39: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 39/323

  39

Chapter 3: Data Generalization, DataWarehousing, and On-line Analytical

Processing

Data generalization and concept description

Data warehouse: Basic concept

Data warehouse modeling: Data cube and OLAP

Data warehouse architecture

Data warehouse implementation From data warehousing to data mining

u e: a ce o

Page 40: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 40/323

  40

u e: a ce oCuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier  

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 41: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 41/323

  41

Conceptual Modeling of DataWarehouses

Modeling data warehouses: dimensions &measures Star schema: A fact table in the middle connected to a set

of dimension tables

Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of 

smaller dimension tables, forming a shape similar to

snowflake

Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation 

Page 42: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 42/323

  42

Example of Star Schema

 time_key

day

day_of_the_week 

month

quarter 

year 

time

location_key

street

city

state_or_province

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_solddollars_sold

avg_sales

Measures

item_key

item_name

 brand

type

supplier_type

item

 branch_key

 branch_name branch_type

 branch

Page 43: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 43/323

  43

 Schema

time_key

day

day_of_the_week 

month

quarter 

year 

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_solddollars_sold

avg_sales

Measures

item_key

item_name

 brand

type

supplier_key

item

 branch_key

 branch_name branch_type

 branch

supplier_key

supplier_type

supplier 

city_key

city

state_or_province

country

city

xamp e o ac

Page 44: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 44/323

  44

xamp e o acConstellation

time_key

day

day_of_the_week 

month

quarter 

year 

time

location_key

streetcity

 province_or_state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

 brand

type

supplier_type

item

 branch_key

 branch_name

 branch_type

 branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_keyshipper_type

shipper 

C b D fi iti S t (BNF) i

Page 45: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 45/323

  45

Cube Definition Syntax (BNF) inDMQL

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list> Dimension Definition (Dimension Table)

define dimension <dimension_name> as 

(<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables)

First time as “cube definition” define dimension <dimension_name> as 

<dimension_name_first_time> in cube <cube_name_first_time>

Defining Star Schema in

Page 46: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 46/323

  46

Defining Star Schema inDMQL

define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars),avg_sales = avg(sales_in_dollars),units_sold = count(*)

define dimension time as (time_key, day,day_of_week, month, quarter, year)

define dimension item as (item_key, item_name,brand, type, supplier_type)

define dimension branch as (branch_key,

branch_name, branch_type)define dimension location as (location_key, street,

city, province_or_state, country)

e n ng now a e c ema

Page 47: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 47/323

  47

e n ng now a e c emain DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month,

quarter, year)

define dimension item as (item_key, item_name, brand, type,

supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name,

branch_type)define dimension location as (location_key, street, city(city_key,

province_or_state, country))

Page 48: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 48/323

  48

 in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter,year)

define dimension item as (item_key, item_name, brand, type,

supplier_type)define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city,province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube salesdefine dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

Measures of Data Cube: Three

Page 49: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 49/323

  49

Measures of Data Cube: ThreeCategories

Distributive: if the result derived by applying the functionto n aggregate values is the same as that derived by

applying the function on all the data without partitioning E.g., count(), sum(), min(), max()

Algebraic: if it can be computed by an algebraic functionwith M arguments (where M is a bounded integer), each

of which is obtained by applying a distributive aggregate

function E.g.,  avg(), min_N(), standard_deviation()

Holistic: if there is no constant bound on the storage size

needed to describe a subaggregate.  E.g., median(), mode(), rank()

oncept erarc y:

Page 50: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 50/323

  50

oncept erarc y:Dimension (location)

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver 

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

Page 51: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 51/323

  51

Multidimensional Data

Sales volume as a function of product,month, and region

       P     r     o 

       d      u 

      c       t 

   R   e  g    i  o

  n

Month

Dimensions: Product, Location, Time

Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week 

Office Day

Page 52: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 52/323

  52

A Sample Data Cube

Total annual salesof TV in U.S.A.

Date

    P   r  o

   d   u  c   t

      C     o     u     n     t 

r    y  

sum

sumTV

VCR PC

1Qtr  2Qtr  3Qtr  4Qtr 

U.S.A

Canada

Mexico

 sum

u o s orrespon ng o

Page 53: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 53/323

  53

u o s orrespon ng othe Cube

all

 product date country

 product,date product,country date, country

 product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

Page 54: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 54/323

  54

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or 

detailed data, or introducing new dimensions Slice and dice:  project and select   Pivot (rotate): 

reorient the cube, visualization, 3D to series of 2D planes Other operations

drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Page 55: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 55/323

  55

Fig. 3.10 TypicalOLAP Operations

Design of Data Warehouse: A

Page 56: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 56/323

  56

Design of Data Warehouse: ABusiness Analysis Framework 

Four views regarding the design of a datawarehouse  Top-down view

allows selection of the relevant information necessary

for the data warehouse Data source view

exposes the information being captured, stored, and

managed by operational systems

Data warehouse view consists of fact tables and dimension tables

Business query view 

sees the perspectives of data in the warehouse from

the view of end-user

a a are ouse es gn

Page 57: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 57/323

  57

a a a e ouse es gProcess

 Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfall: structured and systematic analysis at each step before

proceeding to the next Spiral: rapid generation of increasingly functional systems, short

turn around time, quick turn around

 Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process

Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record

Page 58: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 58/323

  58

Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture

Data

Warehouse

ExtractTransform

Load

Refresh

OLAP Engine

Analysis

QueryReports

Data mining

Monitor 

&

Integrator Metadata

Data SourcesFront-End Tools

Serve

Data Marts

OperationalDBs

Other 

sources

Data Storage

OLAP Server 

ree a a are ouse

Page 59: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 59/323

  59

Models

Enterprise warehouse collects all of the information about subjects spanning the

entire organization

Data Mart

a subset of corporate-wide data that is of value to a specificgroups of users. Its scope is confined to specific, selected

groups, such as marketing data mart Independent vs. dependent (directly from warehouse)

data mart

Virtual warehouse A set of views over operational databases Only some of the possible summary views may be

materialized

 Development: A

Page 60: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 60/323

  60

Development: ARecommended Approach

Define a high-level corporate data model

Data

Mart

Data

Mart

Distributed

Data Marts

Multi-Tier Data

Warehouse

Enterprise

Data

Warehouse

Model refinementModel refinement

Data Warehouse Back-End Tools

Page 61: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 61/323

  61

and Utilities

Data extraction get data from multiple, heterogeneous, and external

sources Data cleaning

detect errors in the data and rectify them when possible

Data transformation convert data from legacy or host format to warehouse

format Load

sort, summarize, consolidate, compute views, check

integrity, and build indicies and partitions Refresh

propagate the updates from the data sources to thewarehouse

d i

Page 62: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 62/323

  62

Metadata Repository

Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse schema, view, dimensions, hierarchies, derived data defn, data mart

locations and contents

Operational meta-data data lineage (history of migrated data and transformation path), currency of 

data (active, archived, or purged), monitoring information (warehouse usagestatistics, error reports, audit trails)

 The algorithms used for summarization

 The mapping from operational environment to the data warehouse

Data related to system performance

warehouse schema, view and derived data definitions Business data

business terms and definitions, ownership of data, charging policies

OLAP Server Architectures

Page 63: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 63/323

  63

OLAP Server Architectures

Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware

Include optimization of DBMS backend, implementation of 

aggregation navigation logic, and additional tools and services

Greater scalability Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine

Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) Flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake

schemas

Efficient Data Cube

Page 64: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 64/323

  64

Computation

Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid

 The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L

levels?

Materialization of data cube Materialize every (cuboid) (full materialization), none (no

materialization), or some (partial materialization) Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

)11( +∏

==

n

ii

 LT 

D h I l i

Page 65: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 65/323

  65

Data warehouse Implementation

Efficient Cube Computation

Efficient Indexing

Efficient Processing of OLAP Queries

Cube Operation

Page 66: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 66/323

  66

Cube Operation

Cube definition and computation in DMQL

define cube sales[item, city, year]:sum(sales_in_dollars)

compute cube sales   Transform it into a SQL-like language (with a new

operator cube by, introduced by Gray et al.’96)

SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year Need compute the following Group-Bys 

( date, product, customer),(date,product),(date, customer), (product,

customer),(date), (product), (customer)()

(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

u t - ay rrayu - ay rrayA tiA ti

Page 67: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 67/323

  67

y yy yAggregationAggregation

Array-based “bottom-up” 

algorithm

Using multi-dimensional chunks

No direct tuple comparisons Simultaneous aggregation on

multiple dimensions

Intermediate aggregate values

are re-used for computingancestor cuboids

Cannot do Apriori pruning: No

iceberg optimization

a l l

A B

A B

A B C

A C B C

C

Multi-way Array Aggregation

Page 68: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 68/323

  68

Multi way Array Aggregationfor Cube Computation (MOLAP)

Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in the order

which minimizes the # of times to visit each cell, and reduces memory

access and storage cost.

What is the best

traversing order

to do multi-wayaggregation?

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

 b3

 b2

 b1

 b0

a2 a3

C

B

44

28 5640

24 5236

20

60

-Aggregation for Cube

Page 69: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 69/323

  69

Aggregation for CubeComputation

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

 b3

 b2

 b1

 b0

a2 a3

C

4428

5640

2452

3620

60

B

 Aggregation for Cube

Page 70: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 70/323

  70

Aggregation for CubeComputation

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

 b3

 b2

 b1

 b0

a2 a3

C

4428

5640

2452

3620

60

B

-Aggregation for Cube

Page 71: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 71/323

  71

Aggregation for CubeComputation (Cont.)

Method: the planes should be sorted andcomputed according to their size in ascending

order Idea: keep the smallest plane in the main memory, fetch

and compute only one chunk at a time for the largest

plane

Limitation of the method: computing well only for

a small number of dimensions If there are a large number of dimensions, “top-down”

computation and iceberg cube computation methods can

be explored

n ex ng a a: mapInde

Page 72: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 72/323

  72

Index

Index on a particular column Each value in the column has a bit vector: bit-op is fast  The length of the bit vector: # of records in the base table  The i-th bit is set if the i-th row of the base table has the

value for the indexed column not suitable for high cardinality domains

Cust Region Type

C1 Asia RetailC2 Europe Dealer 

C3 Asia Dealer  

C4 America Retail

C5 Europe Dealer 

RecID Retail Dealer 

1 1 02 0 1

3 0 1

4 1 0

5 0 1

ecI Asia Europe America

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

Base table Index on Region Index on Type

 Indices

Page 73: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 73/323

  73

Indices

 Join index: JI(R-id, S-id) where R (R-id, …)

S (S-id, …)  Traditional indices map the values to a list

of record ids It materializes relational join in JI file and

speeds up relational join In data warehouses, join index relates the

values of the dimensions of a start schemato rows in the fact table. E.g. fact table: Sales and two dimensions

city and product  A join index on city maintains for

each distinct city a list of R-IDs of the

tuples recording the Sales in the city  Join indices can span multiple dimensions

 Queries

Page 74: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 74/323

  74

Queries

Determine which operations should be performed on the available cuboids

 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =

selection + projection

Determine which materialized cuboid(s) should be selected for OLAP op.

Let the query to be processed be on {brand, province_or_state} with the condition “year =

2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?

Explore indexing structures and compressed vs. dense array structs in MOLAP

Data Warehouse Usage

Page 75: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 75/323

  75

Data Warehouse Usage

 Three kinds of data warehouse applications Information processing

supports querying, basic statistical analysis, and

reporting using crosstabs, tables, charts and graphs

Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling,

pivoting

Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,

performing classification and prediction, and presenting

the mining results using visualization tools

  (OLAP)

Page 76: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 76/323

  76

( )to On Line Analytical Mining (OLAM)

Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned

data Available information processing structure surrounding

data warehouses ODBC, OLEDB, Web accessing, service

facilities, reporting and OLAP tools OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

Integration and swapping of multiple miningfunctions, algorithms, and tasks

An OLAM System Architecture

Page 77: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 77/323

  77

An OLAM System Architecture

Data

Warehouse

MetaData

MDDB

OLAM

Engine

OLAP

Engine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data

Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

UNIT II- Data Preprocessing

Page 78: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 78/323

  78

UNIT II- Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction Summary

Major Tasks in Data Preprocessing

Page 79: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 79/323

  79

Major Tasks in Data Preprocessing

Data cleaning Fill in missing values, smooth noisy data, identify orremove outliers, and resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files

Data transformation Normalization and aggregation Data reduction

Obtains reduced representation in volume but producesthe same or similar analytical results

Data discretization: part of data reduction, of particularimportance for numerical data

Data Cleaning

Page 80: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 80/323

  80

Data Cleaning

No quality data, no quality mining results! Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or

even misleading statistics “Data cleaning is the number one problem in data

warehousing”—DCI survey Data extraction, cleaning, and transformation comprises the

majority of the work of building a data warehouse

Data cleaning tasks Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

a a n e ea or sDirty

Page 81: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 81/323

  81

Dirty

incomplete: lacking attribute values, lackingcertain attributes of interest, or containing onlyaggregate data e.g., occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes ornames, e.g., Age=“42” Birthday=“03/07/1997” Was rating “1,2,3”, now rating “A, B, C” discrepancy between duplicate records

Why Is Data Dirty?

Page 82: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 82/323

  82

Why Is Data Dirty?

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from

Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Multi-Dimensional Measure of DataQuality

Page 83: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 83/323

  83

Quality

A well-accepted multidimensional view: Accuracy Completeness Consistency  Timeliness

Believability Value added Interpretability Accessibility

Broad categories: Intrinsic, contextual, representational, and accessibility

Missing Data

Page 84: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 84/323

  84

Missing Data

Data is not always available E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data

Missing data may be due to equipment malfunction

inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time

of entry not register history or changes of the data

Missing data may need to be inferred

ow o an e ss ngData?

Page 85: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 85/323

  85

Data?

Ignore the tuple: usually done when class label is missing(when doing classification)—not effective when the % of 

missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree

Noisy Data

Page 86: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 86/323

  86

Noisy Data

Noise: random error or variance in a measuredvariable Incorrect attribute values may due to

faulty data collection instruments data entry problems

data transmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaning duplicate records

incomplete data inconsistent data

How to Handle Noisy Data?

Page 87: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 87/323

  87

How to Handle Noisy Data?

Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc. Regression

smooth by fitting the data into regression functions Clustering detect and remove outliers

Combined computer and human inspection detect suspicious values and check by human (e.g., deal

with possible outliers)

Simple Discretization Methods: Binning

Page 88: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 88/323

  88

Simple Discretization Methods: Binning

Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid

if  A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B – A)/N.

 The most straightforward, but outliers may dominate presentation Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples Good data scaling

Managing categorical attributes can be tricky

 Smoothing

Page 89: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 89/323

  89

Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,26, 28, 29, 34* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

Regression

Page 90: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 90/323

  90

Regression

x

y

y = x + 1

X1

Y1

Y1’

Cluster Analysis

Page 91: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 91/323

  91

Cluster Analysis

Data Cleaning as a Process

Page 92: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 92/323

  92

Data Cleaning as a Process

Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools

Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and makecorrections Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation andclustering to find outliers)

Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface Integration of the two processes

Iterative and interactive (e.g., Potter’s Wheels)

Data Integration

Page 93: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 93/323

  93

Data Integration

Data integration: Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources

Entity identification problem:

Identify real world entities from multiple data sources, e.g.,Bill Clinton = William Clinton Detecting and resolving data value conflicts

For the same real world entity, attribute values fromdifferent sources are different

Possible reasons: different representations, differentscales, e.g., metric vs. British units

Han ng Re un ancy n DataIntegration

Page 94: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 94/323

  94

g

Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have

different names in different databases

Derivable data: One attribute may be a “derived” attribute inanother table, e.g., annual revenue

Redundant attributes may be able to be detected by

correlation analysis

Careful integration of the data from multiple sourcesmay help reduce/avoid redundancies and

inconsistencies and improve mining speed and

quality

Correlation Analysis (Numerical Data)

Page 95: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 95/323

  95

y ( )

Correlation coefficient (also called Pearson’s productmoment coefficient)

where n is the number of tuples, and are the respective means

of p and q, σp and σq are the respective standard deviation of p and q,

and Σ(pq) is the sum of the pq cross-product.

If rp,q > 0, p and q are positively correlated (p’s valuesincrease as q’s). The higher, the stronger correlation.

rp,q = 0: independent; rpq < 0: negatively correlated

q pq p

q p

n

q pn pq

n

qq p pr 

σ  σ  σ  σ  

)1(

)(

)1(

))((,

−=

−−=

∑∑

 pq

 relationship)

Page 96: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 96/323

  96

p)

Correlation measures the linear relationshipbetween objects  To compute correlation, we standardize

data objects, p and q, and then take theirdot product

)(/))((  p std  pmean p p k k  −=′

)(/))(( q std qmeanqq k k  −=′

q pq pncorrelatio ′•′=),(

Data Transformation

Page 97: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 97/323

  97

Data Transformation

A function that maps the entire set of values of a

given attribute to a new set of replacement values s.t.each old value can be identified with one of the newvalues

Methods Smoothing: Remove noise from data

Aggregation: Summarization, data cube construction Generalization: Concept hierarchy climbing Normalization: Scaled to fall within a small, specified range

min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones

a a rans orma on:Normalization

Page 98: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 98/323

  98

Normalization

Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0,

1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling

716.00)00.1(

000,12000,98

000,12600,73=+−

 A A A

 A A

 A

minnewminnewmaxnewminmax

minvv  _ ) _  _ (' +−

−−

=

 A

 Avv

σ  

 µ −='

 j

vv

10'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

=−

Data Reduction Strategies

Page 99: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 99/323

  99

g

Why data reduction?

A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to

run on the complete data set Data reduction: Obtain a reduced representation of the

data set that is much smaller in volume but yet produce thesame (or almost the same) analytical results

Data reduction strategies Dimensionality reduction — e.g., remove unimportantattributes

Numerosity reduction (some simply call it: Data Reduction) Data cub aggregation Data compression Regression Discretization (and concept hierarchy generation)

Dimensionality Reduction

Page 100: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 100/323

  100

Dimensionality Reduction

Curse of dimensionality

When dimensionality increases, data becomes increasinglysparse Density and distance between points, which is critical to

clustering, outlier analysis, becomes less meaningful  The possible combinations of subspaces will grow exponentially

Dimensionality reduction

Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

Dimensionality reduction techniques Principal component analysis

Singular value decomposition Supervised and nonlinear techniques (e.g., feature selection)

Dimensionality Reduction: PrincipalComponent Analysis (PCA)

Page 101: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 101/323

  101

x2

x1

e

Component Analysis (PCA)

Find a projection that captures the largest amountof variation in data

Find the eigenvectors of the covariance matrix, andthese eigenvectors define the new space

Principal Component Analysis(Steps)

Page 102: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 102/323

  102

Given N data vectors from n-dimensions, find k ≤ n

orthogonal vectors ( principal components) that can be best

used to represent data

Normalize input data: Each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components

Each input data (vector) is a linear combination of the k principalcomponent vectors

 The principal components are sorted in order of decreasing

“significance” or strength

Since the components are sorted, the size of the data can bereduced by eliminating the weak components, i.e., those with low

variance (i.e., using the strongest principal components, it is

possible to reconstruct a good approximation of the original data)

Works for numeric data only

(Steps)

Feature Subset Selection

Page 103: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 103/323

  103

Another way to reduce dimensionality of data Redundant features

duplicate much or all of the information contained in one

or more other attributes

E.g., purchase price of a product and the amount of salestax paid

Irrelevant features contain no information that is useful for the data mining

task at hand E.g., students' ID is often irrelevant to the task of 

predicting students' GPA

Heuristic Search in FeatureSelection

Page 104: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 104/323

  104

 There are 2d possible feature combinations of d  features

 Typical heuristic feature selection methods: Best single features under the feature independence

assumption: choose by significance tests Best step-wise feature selection:

 The best single-feature is picked first Then next best feature condition to the

first, ... Step-wise feature elimination:

Repeatedly eliminate the worst feature Best combined feature selection and elimination Optimal branch and bound:

Use feature elimination and backtracking

Feature Creation

Page 105: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 105/323

  105

Create new attributes that can capture theimportant information in a data set much moreefficiently than the original attributes

 Three general methodologies Feature extraction

domain-specific Mapping data to new space (see: data reduction) E.g., Fourier transformation, wavelet

transformation Feature construction

Combining features Data discretization

Mapping Data to a New Space

Page 106: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 106/323

  106

pp g p

Two Sine Waves Two Sine Waves + Noise Frequency

Fourier transform

Wavelet transform

Numerosity (Data)Reduction

Page 107: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 107/323

  107

Reduction

Reduce data volume by choosing alternative,smaller forms of data representation

Parametric methods (e.g., regression) Assume the data fits some model, estimate model

parameters, store only the parameters, and discard thedata (except possible outliers)

Example: Log-linear models—obtain value at a point inm-D space as the product on appropriate marginalsubspaces

Non-parametric methods Do not assume models Major families: histograms, clustering, sampling

 Regression and Log-LinearModels

Page 108: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 108/323

  108

Models

Linear regression: Data are modeled to fit a

straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y

to be modeled as a linear function of 

multidimensional feature vector Log-linear model: approximates discrete

multidimensional probability distributions

Regress Analysis and Log-Linear Models

Page 109: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 109/323

  109

Linear regression: Y = w X + b  Two regression coefficients, w and b, specify the line andare to be estimated by using the data at hand

Using the least squares criterion to the known values of Y 1, Y 2, …, X 1, X 2, ….

Multiple regression: Y = b0 + b1 X 1 + b2 X 2. Many nonlinear functions can be transformed into the

above Log-linear models:

 The multi-way table of joint probabilities is approximatedby a product of lower-order tables

Probability:  p(a, b, c, d) = α  ab β  ac χ  ad δ  bcd

Linear Models

Data Reduction:Wavelet Transformation

 

Page 110: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 110/323

  110

Wavelet Transformation

Discrete wavelet transform (DWT): linear signalprocessing, multi-resolutional analysis

Compressed approximation: store only a small

fraction of the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but betterlossy compression, localized in space

Method: Length, L, must be an integer power of 2 (padding with 0’s, when

necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length

Haar2 Daubechie4

DWT for Image Compression

Page 111: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 111/323

  111

g p

Image

  Low Pass High Pass

Low Pass High Pass

Low Pass High Pass

Data Cube Aggregation

Page 112: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 112/323

  112

 The lowest level of a data cube (base cuboid)  The aggregated data for an individual entity of interest

E.g., a customer in a phone calling data warehouse

Multiple levels of aggregation in data cubes Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation which is enough to solve

the task

Queries regarding aggregated information should

be answered using data cube, when possible

Data Compression

Page 113: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 113/323

  113

String compression  There are extensive theories and well-tuned algorithms  Typically lossless But only limited manipulation is possible without expansion

Audio/video compression  Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed

without reconstructing the whole  Time sequence is not audio

 Typically short and vary slowly with time

Data Compression

Page 114: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 114/323

  114

Original Data Compressed

Data

lossless

Original Data

Approximated

  l o s s  y

Data Reduction: Histograms

Page 115: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 115/323

  115

Divide data into buckets and

store average (sum) for each

bucket

Partitioning rules:

Equal-width: equal bucket range

Equal-frequency (or equal-

depth)

V-optimal: with the least

histogram variance (weighted

sum of the original values thateach bucket represents)

MaxDiff: set bucket boundary

between each pair for pairs

have the β–1 largest differences

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

1 0 0 0 0 3 0 0 0 0 5 0 0 0 0 7 0 0 0 0 9 0 0 0 0

a a e uc on e o :Clustering

Page 116: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 116/323

  116

Partition data set into clusters based on similarity,and store cluster representation (e.g., centroid and

diameter) only

Can be very effective if data is clustered but not if 

data is “smeared”

Can have hierarchical clustering and be stored in

multi-dimensional index tree structures

 There are many choices of clustering definitions andclustering algorithms

Cluster analysis will be studied in depth in Chapter 7

Data Reduction Method: Sampling

Page 117: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 117/323

  117

Sampling: obtaining a small sample s to represent thewhole data set N

Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data

Key principle: Choose a representative subset of thedata Simple random sampling may have very poor performance in

the presence of skew

Develop adaptive sampling methods, e.g., stratified sampling:

Note: Sampling may not reduce database I/Os (page

at a time)

Types of Sampling

Page 118: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 118/323

  118

Simple random sampling  There is an equal probability of selecting any particular

item Sampling without replacement

Once an object is selected, it is removed from the

population Sampling with replacement

A selected object is not removed from the population

Stratified sampling: Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the samepercentage of the data)

Used in conjunction with skewed data

Sampling: With or without

Page 119: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 119/323

  119

Replacement

 S R  S W O

 R 

 (  s i m p l e  r a n d o m

  s a m p l e  w i

 t h o u t 

 r e p l a c e m e

 n t )

S R S W R 

Raw Data

amp ng: us er orStratified Sampling

Page 120: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 120/323

  120

Raw Data Cluster/Stratified Sample

a a e uc on:Discretization

Page 121: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 121/323

  121

 Three types of attributes: Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic

rank

Continuous — real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical

attributes.

Reduce data size by discretization

Prepare for further analysis

scre za on an oncepHierarchy

Page 122: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 122/323

  122

Discretization

Reduce the number of values for a given continuous attribute by

dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation

Recursively reduce the data by collecting and replacing low levelconcepts (such as numeric values for age) by higher level

concepts (such as young, middle-aged, or senior)

Discretization and Concept HierarchyGeneration for Numeric Data

Page 123: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 123/323

  123

 Typical methods: All the methods can be applied recursively Binning (covered above)

 Top-down split, unsupervised,

Histogram analysis (covered above)

 Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization: supervised, top-down split Interval merging by χ 2 Analysis: unsupervised, bottom-up merge

Segmentation by natural partitioning: top-down split,

unsupervised

Labels

Page 124: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 124/323

  124

Entropy based approach

3 categories for both x and y 5 categories for both x and y

Entropy-Based Discretization

Page 125: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 125/323

  125

Given a set of samples S, if S is partitioned into two intervals S1 and

S2 using boundary T, the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in

the set. Given m classes, the entropy of S1 is

where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over all possible

boundaries is selected as a binary discretization

 The process is recursively applied to partitions obtained until somestopping criterion is met

Such a boundary may reduce data size and improve classification

accuracy

)(||

||)(

||

||),( 2

21

1S  Entropy

S S  Entropy

S T S  I  +=

∑=

−=m

i

ii p pS  Entropy1

21 )(log)(

Class Labels

Page 126: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 126/323

  126

Data Equal interval width

Equal frequency K-means

n erva erge y χ  Analysis

Page 127: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 127/323

  127

Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to

form larger intervals recursively

ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered tobe one interval

χ 2tests are performed for every pair of adjacent intervals

Adjacent intervals with the least χ 2values are merged together,

since low χ 2values for a pair indicate similar class distributions

 This merge process proceeds recursively until a predefined

stopping criterion is met (such as significance level, max-interval,

max inconsistency, etc.)

egmen a on y a uraPartitioning

Page 128: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 128/323

  128

A simply 3-4-5 rule can be used to segmentnumeric data into relatively uniform, “natural”

intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the

most significant digit, partition the range into 3 equi-

width intervals

If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most

significant digit, partition the range into 5 intervals

Example of 3-4-5 Rule

Page 129: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 129/323

  129

(-$400 -$5,000)

(-$400 - 0)

(-$400 --$300)

(-$300 -

-$200)

(-$200 -

-$100)

(-$100 -

0)

(0 - $1,000)

(0 -

$200)

($200 -

$400)

($400 -

$600)

($600 -

$800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 -$3,000)

($3,000 -

$4,000)

($4,000 -

$5,000)

($1,000 - $2, 000)

($1,000 -

$1,200)

($1,200 -

$1,400)

($1,400 -

$1,600)

($1,600 -

$1,800)($1,800 -

$2,000)

msd=1,000 Low=-$1,000 High=$2,000Step 2:

Step 4:

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000)

Concept Hierarchy Generation forCategorical Data

Page 130: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 130/323

  130

Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts street < city < state < country

Specification of a hierarchy for a set of values by

explicit data grouping {Urbana, Champaign, Chicago} < Illinois

Specification of only a partial set of attributes E.g., only street < city, not others

Automatic generation of hierarchies (or attributelevels) by the analysis of the number of distinct

values E.g., for a set of attributes: {street, city, state, country}

Automatic Concept HierarchyGeneration

Page 131: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 131/323

  131

Some hierarchies can be automaticallygenerated based on the analysis of the numberof distinct values per attribute in the data set  The attribute with the most distinct values is placed at

the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year

country

 province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

: n ng requentPatterns, Association andCorrelations

Page 132: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 132/323

  132

Correlations

Basic concepts and a road map

Efficient and scalable frequent itemset

mining methods

Mining various kinds of association rules

From association mining to correlation

analysis

Constraint-based association mining

Summary

a s requen a ernAnalysis?

Page 133: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 133/323

  133

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the

context of frequent itemsets and association rule mining

Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?

Applications Basket data analysis, cross-marketing, catalog design, sale campaign

analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern MiningImportant?

Page 134: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 134/323

  134

Discloses an intrinsic and important property of data

sets

Forms the foundation for many essential data mining

tasks Association, correlation, and causality analysis

Sequential, structural (e.g., sub-graph) patterns

Pattern analysis in spatiotemporal, multimedia, time-series,

and stream data

Classification: associative classification

Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient

Semantic data compression: fascicles

Broad applications

Basic Concepts: FrequentPatterns and Association Rules

Page 135: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 135/323

  135

Itemset X = {x1, …, xk} Find all the rules X  Y  with

minimum support and confidence support, s, probability that a

transaction contains X ∪ Y confidence, c, conditional

probability that a transactionhaving X also contains Y 

Let supmin = 50%, conf min = 50%Freq. Pat.: { A:3, B:3, D:4, E:3,

 AD:3}

Association rules:

 A D (60%, 100%)

D A (60%, 75%)

Customer

buys diaper

Customer

buys both

Customer

buys beer

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

ose a erns an ax-Patterns

Page 136: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 136/323

  136

A long pattern contains a combinatorial number of sub-patterns,

e.g., {a1, …, a100 } contains (100 1) + (100 2) + … + (110000) = 2100 – 1= 1.27*1030 sub-patterns!

Solution: Mine closed patterns and max-patterns instead 

An itemset X is closed if X is frequent and there exists no super-

 pattern Y כ X, with the same support as X (proposed by

Pasquier, et al. @ ICDT’99)

An itemset X is a max-pattern if X is frequent and there exists

no frequent super-pattern Y כ X (proposed by Bayardo @

SIGMOD’98)

Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

ose a erns an ax-Patterns

Page 137: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 137/323

  137

Exercise. DB = {<a1, …, a100 >, < a1, …,

a50 >}

Min_sup = 1.

What is the set of closed itemset? <a1, …, a100 >: 1

< a1, …, a50 >: 2

What is the set of max-pattern? <a1, …, a100 >: 1

What is the set of all patterns? !!

Scalable Methods for MiningFrequent Patterns

Page 138: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 138/323

  138

 The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper} Scalable mining methods: Three major approaches

Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)

Apriori: A Candidate Generation-and-Test Approach

Page 139: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 139/323

  139

Apriori pruning principle: If there is any itemsetwhich is infrequent, its superset should not be

generated/tested! (Agrawal & Srikant @VLDB’94,

Mannila, et al. @ KDD’ 94)

Method: Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k

frequent itemsets

 Test the candidates against DB

 Terminate when no frequent or candidate set can be

generated

T e Apr or A gor t m—AnExample

Page 140: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 140/323

  140

Database TDB

1st scan

C 1  L1

 L2

C 2 C 22nd scan

C 3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}{B, E}

{C, E}

Itemset sup

{A, B} 1

{A, C} 2

{A, E} 1

{B, C} 2

{B, E} 3{C, E} 2

Itemset sup

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

The Apriori Algorithm

Page 141: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 141/323

  141

Pseudo-code:

Ck : Candidate itemset of size kLk  : frequent itemset of size k

L1 = {frequent items};for (k = 1; L

k !=∅; k ++) do begin

  Ck+1 = candidates generated from Lk ;  for each transaction t in database do

  increment the count of all candidates inCk+1 that are contained in t 

  Lk+1 = candidates in Ck+1 withmin_support

  endreturn ∪k  Lk ;

Important Details of Apriori

Page 142: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 142/323

  142

How to generate candidates? Step 1: self-joining Lk 

Step 2: pruning

How to count supports of candidates?

Example of Candidate-generation

L3={abc, abd, acd, ace, bcd }

Self-joining: L3*L3

abcd from abc and abd 

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd }

ow o enera eCandidates?

Page 143: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 143/323

  143

Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 

insert into Ck 

select p.item1 , p.item

2 , …, p.item

k-1 , q.item

k-1

from Lk-1

p, Lk-1

q

where p.item1=q.item

1 , …, p.item

k-2=q.item

k-2 , p.item

k-1<

q.itemk-1

Step 2: pruningforall itemsets c in C

k  do

forall (k-1)-subsets s of c do

if (s is not in Lk-1 ) then delete c from Ck 

How to Count Supports of Candidates?

Page 144: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 144/323

  144

Why counting supports of candidates a problem?  The total number of candidates can be very huge

One transaction may contain many candidates

Method: Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and

counts

Interior node contains a hash table

Subset function: finds all the candidates contained in a

transaction

Example: Counting Supports of Candidates

Page 145: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 145/323

  145

1,4,7

2,5,8

3,6,9Subset function

2 3 4

5 6 7

1 4 51 3 6

1 2 4

4 5 7 1 2 5

4 5 8

1 5 9

3 4 5 3 5 6

3 5 7

6 8 9

3 6 7

3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6

Efficient Implementation of Apriori inSQL

Page 146: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 146/323

  146

Hard to get good performance out of pure SQL(SQL-92) based approaches alone

Make use of object-relational extensions like

UDFs, BLOBs, Table functions etc.

Get orders of magnitude improvement

S. Sarawagi, S. Thomas, and R. Agrawal.

Integrating association rule mining with relational

database systems: Alternatives and implications.

In SIGMOD’98

Challenges of Frequent PatternMining

Page 147: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 147/323

  147

Challenges Multiple scans of transaction database

Huge number of candidates

 Tedious workload of support counting for candidates

Improving Apriori: general ideas

Reduce passes of transaction database scans

Shrink number of candidates

Facilitate support counting of candidates

Partition: Scan Database OnlyTwice

Page 148: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 148/323

  148

Any itemset that is potentially frequent in DB mustbe frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An

efficient algorithm for mining association in large

databases. In VLDB’95

DHP: Reduce the Number of Candidates

Page 149: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 149/323

  149

A k -itemset whose corresponding hashing bucketcount is below the threshold cannot be frequent

Candidates: a, b, c, d, e

Hash entries: {ab, ad, ae} {bd, be, de} …

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of {ab,

ad, ae} is below support threshold

 J. Park, M. Chen, and P. Yu. An effective hash-basedalgorithm for mining association rules. In

SIGMOD’95

Sampling for Frequent Patterns

Page 150: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 150/323

  150

Select a sample of original database, mine frequentpatterns within sample using Apriori

Scan database once to verify frequent itemsets

found in sample, only borders of closure of 

frequent patterns are checked

Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent

patterns

H. Toivonen. Sampling large databases for

association rules. In VLDB’96

DIC: Reduce Number of Scans

Page 151: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 151/323

  151

ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Once both A and D are determinedfrequent, the counting of AD begins

Once all length-2 subsets of BCD aredetermined frequent, the counting of BCD begins

Transactions

1-itemsets

2-itemsets

…Apriori

1-itemsets2-items

3-itemsDIC

S. Brin R. Motwani, J. Ullman,and S. Tsur. Dynamic itemsetcounting and implicationrules for market basket data.In SIGMOD’97

Bottleneck of Frequent-patternMining

Page 152: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 152/323

  152

Multiple database scans are costly Mining long patterns needs many passes of 

scanning and generates lots of candidates  To find frequent itemset i

1

i2

…i100

# of scans: 100

# of Candidates: (1001) + (100

2) + … + (110000) = 2100-1

= 1.27*1030!

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?

Mining Frequent PatternsWithout Candidate Generation

Page 153: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 153/323

  153

Grow long patterns from short ones using

local frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc abcd is a

frequent pattern

Construct FP-tree from a TransactionDatabase

Page 154: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 154/323

 

{}

  f:4 c:1

b:1

 p:1

b:1c:3

a:3

b:1m:2

  p:2 m:1

Header Table

  Item frequency head  f 4c 4a 3

b 3m 3 p 3

min_support = 3

TID Items bought (ordered) frequent items

100 { f, a, c, d, g, i, m, p} { f, c, a, m, p}200 {a, b, c, f, l, m, o} { f, c, a, b, m}300   {b, f, h, j, o, w } { f, b}400   {b, c, k, s, p} {c, b, p}500   {a, f, c, e, l, p, m, n} { f, c, a, m, p}

1. Scan DB once, findfrequent 1-itemset(single item pattern)

2. Sort frequent items

in frequencydescending order, f-list

3. Scan DB again,construct FP-tree F-list=f-c-a-b-m-p

Benefits of the FP-treeStructure

Page 155: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 155/323

 

Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently

occurring, the more likely to be shared Never be larger than the original database (not count node-

links and the count field) For Connect-4 DB, compression ratio could be over 100

ar on a erns anDatabases

Page 156: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 156/323

 

Frequent patterns can be partitioned intosubsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, p Pattern f 

Completeness and non-redundency

Find Patterns Having P From P-conditional Database

Page 157: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 157/323

  157

Starting at the frequent item header table in the FP-tree  Traverse the FP-tree by following the link of each frequent

item p Accumulate all of transformed prefix paths of item p to

form p’ s conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3b fca:1, f:1, c:1

m fca:2, fcab:1

 p fcam:2, cb:1

{}

  f:4 c:1

b:1

 p:1

b:1c:3

a:3

b:1m:2

  p:2 m:1

Header Table

  Item frequency head  f 4c 4a 3

b 3m 3 p 3

From Conditional Pattern-bases toConditional FP-trees 

Page 158: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 158/323

  158

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:

 fca:2, fcab:1

{}

 f:3

c:3

a:3m-conditional FP-tree

All frequentpatterns relate to m

m,

 fm, cm, am, fcm, fam, cam,

 fcam

{}

 f:4 c:1

b:1

 p:1

b:1c:3

a:3

b:1m:2

 p:2 m:1

Header Table  Item frequency head  f 4c 4

a 3b 3m 3 p 3

Recursion: Mining Each ConditionalFP-tree

Page 159: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 159/323

  159

{}

 f:3

c:3

a:3m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

 f:3

c:3

am-conditional FP-tree

Cond. pattern base of “cm”: (f:3) {}

 f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3){}

 f:3

cam-conditional FP-tree

A Special Case: Single Prefix Pathin FP-tree

Page 160: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 160/323

  160

Suppose a (conditional) FP-tree T has a sharedsingle prefix-path P

Mining can be decomposed into two parts Reduction of the single prefix path into one node

Concatenation of the mining results of the two parts

a2:n2

a3:n3

a1:n1

{}

b1:m1C1:k 1

C2:k 2 C3:k 3

b1:m1C1:k 1

C2:k 2 C3:k 3

r 1

+a2:n2

a3:n3

a1:n1

{}

r 1 =

Mining Frequent Patterns WithFP-trees

Page 161: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 161/323

  161

Idea: Frequent pattern growth

Recursively grow frequent patterns by pattern anddatabase partition Method

For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree

Repeat the process on each newly created conditional FP-tree

Until the resulting FP-tree is empty, or it contains only onepath—single path will generate all the combinations of itssub-paths, each of which is a frequent pattern

Scaling FP-growth by DBProjection

Page 162: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 162/323

  162

FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected

DBs  Then construct and mine FP-tree for each

projected DB

Parallel projection vs. Partition projection techniques Parallel projection is space costly

Partition-based Projection

Page 163: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 163/323

  163

Parallel projection needs

a lot of disk space

Partition projection saves

it

Tran. DBfcampfcabmfbcbpfcamp

 p-proj DBfcamcbfcam

m-proj DBfcabfcafca

b-proj DBf cb…

a-proj DBfc…

c-proj DBf …

f-proj DB…

am-proj DBfcfcfc

cm-proj DBf f f 

FP-Growth vs. Apriori: Scalability Withthe Support Threshold

Page 164: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 164/323

  164

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

   R  u  n   t   i  m  e   (  s  e

D1 FP-growth runtime

D1 Apriori runtime

Data set T25I20D10K 

FP-Growth vs. Tree-Projection:Scalability with the Support Threshold

Page 165: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 165/323

165

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2

Support threshold (%

   R  u  n   t   i  m  e   (  s

  e  c .   )

D2 FP-growth

D2 TreeProjection

Data set T25I20D100K 

Why Is FP-Growth the Winner?

Page 166: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 166/323

166

Divide-and-conquer: decompose both the mining task and DB according to

the frequent patterns obtained so far

leads to focused search of smaller databases

Other factors no candidate generation, no candidate test

compressed database: FP-tree structure

no repeated scan of entire database

basic ops—counting local freq items and building sub FP-

tree, no pattern search and matching

Implications of theMethodology

Page 167: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 167/323

167

Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00)

Mining sequential patterns

FreeSpan (KDD’00), PrefixSpan (ICDE’01)

Constraint-based mining of frequent patterns

Convertible constraints (KDD’00, ICDE’01)

Computing iceberg data cubes with complex measures

H-tree and H-cubing algorithm (SIGMOD’01)

ax ner: n ng ax-patterns

Page 168: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 168/323

168

1st scan: find frequent items A, B, C, D, E

2nd scan: find support for AB, AC, AD, AE, ABCDE

BC, BD, BE, BCDE CD, CE, CDE, DE,

Since BCDE is a max-pattern, no need to check

BCD, BDE, CDE in later scan

R. Bayardo. Efficiently mining long patterns from

databases. In SIGMOD’98

 Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,F

Potentialmax-

patterns

Mining Frequent Closed Patterns:CLOSET

Page 169: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 169/323

169

Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c

Divide search space Patterns having d

Patterns having d but no a, etc. Find frequent closed pattern recursively

Every transaction having d also has cfa cfad is a frequent

closed pattern

 J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm forMining Frequent Closed Itemsets", DMKD'00.

 TID Items

10 a, c, d, e, f 

20 a, b, e

30 c, e, f  

40 a, c, d, f  

50 c, e, f  

Min_sup=2

CLOSET+: Mining ClosedItemsets by Pattern-Growth

Page 170: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 170/323

170

Itemset merging: if Y appears in every occurrence of X,

then Y is merged with X

Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and

all of X’s descendants in the set enumeration tree can be

pruned

Hybrid tree projection

Bottom-up physical tree-projection

 Top-down pseudo tree-projection

Item skipping: if a local frequent item has the samesupport in several header tables at different levels, one

can prune it from the header table at higher levels

Efficient subset checking

CHARM: Mining by Exploring VerticalData Format

Page 171: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 171/323

171

Vertical format: t(AB) = {T11 , T25 , …}

tid-list: list of trans.-ids containing an itemset

Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together

t(X) ⊂ t(Y): transaction having X always has Y

Using diffset to accelerate mining Only keep track of differences of tids

t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}

Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoyet al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)

Further Improvements of Mining Methods

Page 172: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 172/323

172

AFOPT (Liu, et al. @ KDD’03) A “push-right” method for mining condensed frequent

pattern (CFP) tree Carpenter (Pan, et al. @ KDD’03)

Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining

Visualization of Association Rules: PlaneGraph

Page 173: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 173/323

173

Visualization of Association Rules: RuleGraph

Page 174: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 174/323

174  

Rules(SGI/MineSet 3.0)

Page 175: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 175/323

175

Mining Various Kinds of Association Rules

Page 176: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 176/323

176

Mining multilevel association

Miming multidimensional association

Mining quantitative association

Mining interesting correlation patterns

Mining Multiple-LevelAssociation Rules

It ft f hi hi

Page 177: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 177/323

177

Items often form hierarchies

Flexible support settings Items at the lower level are expected to have lower

support Exploration of shared multi-level mining (Agrawal

& Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support

Milk 

[support = 10%]

2% Milk 

[support = 6%]

Skim Milk 

[support = 4%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%

Level 1

min_sup = 5%

Level 2

min_sup = 3%

reduced support

Multi-level Association:Redundancy Filtering

Page 178: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 178/323

178

Some rules may be redundant due to “ancestor”

relationships between items.

Example milk ⇒wheat bread [support = 8%, confidence = 70%]

2% milk ⇒wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second

rule.

A rule is redundant if its support is close to the“expected” value, based on the rule’s ancestor.

Mining Multi-DimensionalAssociation

Si l di i l l

Page 179: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 179/323

179

Single-dimensional rules:

buys(X, “milk”) ⇒buys(X, “bread”) Multi-dimensional rules: ≥ 2 dimensions or predicates

Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) ∧occupation(X,“student”) ⇒buys(X, “coke”)

hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒buys(X, “coke”)

Categorical Attributes: finite number of possible values, no

ordering among values—data cube approach

Quantitative Attributes: numeric, implicit ordering among

values—discretization, clustering, and gradient approaches

Mining Quantitative Associations

T h i b t i d b h i l

Page 180: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 180/323

180

 Techniques can be categorized by how numerical

attributes, such as age or salary are treated1. Static discretization based on predefined concept

hierarchies (data cube methods)

2. Dynamic discretization based on data distribution

(quantitative rules, e.g., Agrawal &Srikant@SIGMOD96)

3. Clustering: Distance-based association (e.g., Yang &

Miller@SIGMOD97)

one dimensional clustering then association

1. Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)

Static Discretization of Quantitative Attributes

Page 181: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 181/323

181

Discretized prior to mining using concept hierarchy.

Numeric values are replaced by ranges.

In relational database, finding all frequent k-

predicate sets will require k or k +1 table scans.

Data cube is well suited for mining.

 The cells of an n-dimensional

cuboid correspond to the

predicate sets. Mining from data cubes

can be much faster.

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

 Rules

Proposed by Lent Swami and Widom ICDE’97

Page 182: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 182/323

182

age(X,”34-35”) ∧income(X,”30-50K”)

⇒buys(X,”high resolution TV”)

Proposed by Lent, Swami and Widom ICDE 97

Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules

mined is maximized

2-D quantitative association rules: Aquan1 ∧Aquan2⇒Acat

Cluster adjacent association rulesto form general

rules using a 2-D grid Example

Mining Other InterestingPatterns

Page 183: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 183/323

183

Flexible support constraints (Wang et al. @

VLDB’02) Some items (e.g., diamond) may occur rarely but are

valuable

Customized supminspecification and application

 Top-K closed frequent patterns (Han, et al. @

ICDM’02) Hard to specify supmin, but top-k with lengthmin is more

desirable

Dynamically raise supmin in FP-tree construction and

mining, and select most promising path to mine

Interestingness Measure:Correlations (Lift)

play basketball ⇒eat cereal [40% 66 7%] is misleading

Page 184: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 184/323

184

 play basketball  ⇒ eat cereal [40%, 66.7%] is misleading

 The overall % of students eating cereal is 75% > 66.7%.

 play basketball  ⇒ not eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

89.05000/3750*5000/3000

5000/2000),( ==C  Blift 

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000)()(

)(

 B P  A P 

 B A P lift 

∪=

33.15000/1250*5000/3000

5000/1000),( ==¬C  Blift 

Are lift and χ2

Good Measures of Correlation?

“Buy walnuts ⇒buy milk [1% 80%]” is misleading

Page 185: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 185/323

185

Buy walnuts ⇒ buy milk [1%, 80%] is misleading

if 85% of customers buy milk Support and confidence are not good to represent correlations

So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)

Milk No Milk Sum (row)

Coffee m, c ~m, c c

No Coffee m, ~c ~m, ~c ~c

Sum(col.) m ~m Σ

)()()( B P  A P  B A P lift  ∪=

DB m, c ~m, c m~c ~m~c lift all-conf coh χ 2

A1 1000 100 100 10,000 9.26 0.91 0.83 9055

A2 100 1000 1000 100,000 8.44 0.09 0.05 670

A3 1000 100 10000 100,000 9.18 0.09 0.09 8172

)sup( _ max_ 

)sup( _ 

 X item

 X conf  all  =

|)(|

)sup(

 X universe

 X coh =

Which Measures Should Be Used?

lift and 2 are not

Page 186: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 186/323

186

lift and χ 2 are not

good measures forcorrelations in largetransactional DBs

all-conf orcoherence could begood measures(Omiecinski@TKDE’03)

Both all-conf andcoherence have thedownward closureproperty

Efficient algorithmscan be derived formining (Lee et al.@ICDM’03sub)

Constraint-based (Query-Directed)Mining

Page 187: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 187/323

187

Finding all the patterns in a databaseautonomously? — unrealistic!  The patterns could be too many but not focused!

Data mining should be an interactive process

User directs what to be mined using a data mining querylanguage (or a graphical user interface)

Constraint-based mining User flexibility: provides constraints on what to be mined

System optimization: explores such constraints forefficient mining—constraint-based mining

Constraints in Data Mining

Knowledge type constraint:

Page 188: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 188/323

188

Knowledge type constraint: classification, association, etc.

Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in

Dec.’02

Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint

small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint

strong rules: min_support ≥ 3%, min_confidence ≥ 60%

Constrained Mining vs. Constraint-Based Search

Page 189: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 189/323

189

Constrained mining vs. constraint-basedsearch/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some

(or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search

It is an interesting research problem on how to integratethem Constrained mining vs. query processing in DBMS

Database query processing requires to find all Constrained pattern mining shares a similar philosophy as

pushing selections deeply in query processing

Anti-Monotonicity in ConstraintPushing

TDB (min_sup=2)

Page 190: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 190/323

190

Anti-monotonicity When an intemset S violates the

constraint, so does any of its superset 

sum(S.Price) ≤  v  is anti-monotone

sum(S.Price) ≥  v  is not anti-monotone

Example. C: range(S.profit) ≤ 15 is

anti-monotone Itemset ab violates C

So does every superset of ab

TransactionTID

a, b, c, d, f 10

b, c, d, f, g, h20

a, c, d, e, f 30

c, e, f, g40

Item Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Monotonicity for ConstraintPushingTDB (min_sup=2)

Page 191: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 191/323

191

Monotonicity When an intemset S satisfies the

constraint, so does any of its superset 

sum(S.Price) ≥  v  is monotone

min(S.Price) ≤ v  is monotone

Example. C: range(S.profit) ≥ 15 Itemset ab satisfies C

So does every superset of ab

TransactionTID

a, b, c, d, f 10

b, c, d, f, g, h20

a, c, d, e, f 30

c, e, f, g40

Item Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Succinctness

i

Page 192: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 192/323

192

Succinctness:

Given A1,the set of items satisfying a succinctness

constraint C, then any set S satisfying C is based on A1 ,

i.e., S contains a subset belonging to A1

Idea: Without looking at the transaction database,whether an itemset S satisfies constraint C can be

determined based on the selection of items

min(S.Price) ≤  v  is succinct

sum(S.Price) ≥  v  is not succinct Optimization: If C is succinct, C is pre-counting

pushable

The Apriori Algorithm —Example

Database D itemset sup it t

Page 193: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 193/323

193

TID Items100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2{2} 3

{3} 3

{5} 3

Scan D

C 1 L

1

itemset{1 2}

{1 3}

{1 5}

{2 3}

{2 5}{3 5}

itemset sup{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

 L2

C 2 C 2Scan D

C 3 L3itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

a ve gor m: pr or +Constraint

Database D itemset sup it t

Page 194: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 194/323

194

TID Items100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2{2} 3

{3} 3

{5} 3

Scan D

C 1 L

1

itemset{1 2}

{1 3}

{1 5}

{2 3}

{2 5}{3 5}

itemset sup{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

 L2

C 2 C 2Scan D

C 3 L3itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

Constraint:

Sum{S.price} <

5  

Algorithm: Push an Anti-monotone Constraint Deep 

Database D itemset sup it t

Page 195: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 195/323

195

TID Items100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2{2} 3

{3} 3

{5} 3

Scan D

C 1 L

1

itemset{1 2}

{1 3}

{1 5}

{2 3}

{2 5}{3 5}

itemset sup{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

 L2

C 2 C 2Scan D

C 3 L3itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

Constraint:

Sum{S.price} <

5

Algorithm: Push a SuccinctConstraint Deep 

Database D itemset sup it t

Page 196: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 196/323

196

TID Items100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2{2} 3

{3} 3

{5} 3

Scan D

C 1 L

1

itemset{1 2}

{1 3}

{1 5}

{2 3}

{2 5}{3 5}

itemset sup{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

 L2

C 2 C 2Scan D

C 3 L3itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

Constraint:

min{S.price } <=

1

not immediatelyto be used

 

Constraints

T tiTID

TDB (min_sup=2)

Page 197: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 197/323

  197

Convert tough constraints into anti-monotone or monotone by properly

ordering items

Examine C: avg(S.profit) ≥ 25 Order items in value-descending order

<a, f, g, d, b, h, c, e> If an itemset afb violates C

So does afbh, afb*

It becomes anti-monotone!

TransactionTID

a, b, c, d, f 10

b, c, d, f, g, h20

a, c, d, e, f 30

c, e, f, g40

Item Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Strongly Convertible Constraints

Page 198: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 198/323

  198

avg(X) ≥ 25 is convertible anti-monotonew.r.t. item value descending order R: <a,f, g, d, b, h, c, e> If an itemset af violates a constraint C, so does

every itemset with af as prefix, such as afd 

avg(X) ≥ 25 is convertible monotonew.r.t. item value ascending order R-1: <e,c, h, b, d, g, f, a> If an itemset d satisfies a constraint C, so does

itemsets df and dfa, which having d as a prefix

 Thus, avg(X) ≥ 25 is strongly convertible

Item Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20h -10

Can Apriori Handle ConvertibleConstraint?

Page 199: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 199/323

  199

A convertible, not monotone nor anti-monotone nor succinct constraint cannotbe pushed deep into the an Apriori miningalgorithm Within the level wise framework, no direct

pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to

assemble adf, df cannot be pruned But it can be pushed into frequent-pattern

growth framework!

Item Value

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Mining With ConvertibleConstraints

C a g(X) > 25 min sup 2

Item Value

a 40

Page 200: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 200/323

  200

C: avg(X) >= 25, min_sup=2

List items in every transaction in value

descending order R: <a, f, g, d, b, h, c, e> C is convertible anti-monotone w.r.t. R

Scan TDB once

remove infrequent items Item h is dropped

Itemsets a and f are good, …

Projection-based mining

Imposing an appropriate order on item projection Many tough constraints can be converted into

(anti)-monotone

TransactionTID

a, f, d, b, c10

f, g, d, b, c20

a, f, d, c, e30

f, g, h, c, e40

TDB (min_sup=2)

a 40

f 30

g 20

d 10

b 0

h -10

c -20

e -30

Handling Multiple Constraints

Page 201: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 201/323

  201

Different constraints may require different oreven conflicting item-ordering

If there exists an order R s.t. both C1 and C2 are

convertible w.r.t. R, then there is no conflict

between the two convertible constraints

If there exists conflict on order of items  Try to satisfy one constraint first

 Then using the order for the other constraint to minefrequent itemsets in the corresponding projected

database

a ons ra n s reConvertible?

Page 202: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 202/323

  202

Constraint Convertible anti-monotone Convertiblemonotone Stronglyconvertible

avg(S) ≤ , ≥ v  Yes Yes Yes

median(S) ≤ , ≥ v  Yes Yes Yes

sum(S) ≤ v (items could be of any

value, v ≥ 0)

 Yes No No

sum(S) ≤ v (items could be of anyvalue, v ≤ 0)

No Yes No

sum(S) ≥ v (items could be of anyvalue, v ≥ 0)

No Yes No

sum(S) ≥ v (items could be of anyvalue, v ≤ 0)

 Yes No No

……

onstra nt-Base M n ng—A eneraPicture

Constraint Antimonotone Monotone Succinct

S

Page 203: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 203/323

  203

v ∈ S no yes yes

S ⊇ V no yes yes

S ⊆ V yes no yes

min(S) ≤ v no yes yes

min(S) ≥ v yes no yes

max(S) ≤ v yes no yes

max(S) ≥ v no yes yescount(S) ≤ v yes no weakly

count(S) ≥ v no yes weakly

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no

sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no

range(S) ≤ v yes no no

range(S) ≥ v no yes no

avg(S) θ v, θ  ∈ { = , ≤ , ≥ } convertible convertible no

support(S) ≥   ξ yes no no

support(S) ≤  ξ no yes no

A Classification of Constraints

Page 204: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 204/323

  204

Convertible

anti-monotone

Convertible

monotone

Stronglyconvertible

Inconvertible

Succinct

Antimonotone

Monotone

Chapter 6. Classification andPrediction

Wh t i l ifi ti ? Wh t Support Vector Machines (SVM)

Page 205: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 205/323

  205

What is classification? What

is prediction?

Issues regarding

classification and prediction

Classification by decision

tree induction

Bayesian classification

Rule-based classification

Classification by back

propagation

Support Vector Machines (SVM)

Associative classification

Lazy learners (or learning from

your neighbors)

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary

Classification

Classification vs. Prediction

Page 206: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 206/323

  206

Classification  predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attributeand uses it in classifying new data

Prediction models continuous-valued functions, i.e., predicts

unknown or missing values

 Typical applications Credit approval  Target marketing Medical diagnosis Fraud detection

Classification—A Two-StepProcess 

Model construction: describing a set of predeterminedl

Page 207: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 207/323

  207

classes Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision

trees, or mathematical formulae

Model usage: for classifying future or unknown objects Estimate accuracy of the model

 The known label of test sample is compared with theclassified result from the model

Accuracy rate is the percentage of test set samples

that are correctly classified by the model  Test set is independent of training set, otherwiseover-fitting will occur

If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

rocess : o eConstruction

Page 208: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 208/323

  208

Training

Data

N A M E R A N K Y E A R S T E N U R E D

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yes

Bill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier 

(Model)

Process (2): Using the Model inPrediction 

Page 209: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 209/323

  209

Classifier 

Testing

Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Supervised vs. UnsupervisedLearning

Page 210: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 210/323

  210

Supervised learning (classification) Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations

New data is classified based on the training set Unsupervised learning (clustering)

 The class labels of training data is unknown

Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or

clusters in the data

Issues: Data Preparation

Page 211: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 211/323

  211

Data cleaning Preprocess data in order to reduce noise and handle

missing values

Relevance analysis (feature selection) Remove the irrelevant or redundant attributes

Data transformation Generalize and/or normalize data

Issues: Evaluating Classification

Methods

Accuracy

Page 212: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 212/323

  212

classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes

Speed time to construct the model (training time) time to use the model (classification/prediction time)

Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability

understanding and insight provided by the model

Other measures, e.g., goodness of rules, such asdecision tree size or compactness of classificationrules

Decision Tree Induction: TrainingDataset

age income student credit rating buys comput

Page 213: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 213/323

  213

age income student credit_rating buys_comput

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

 Thisfollowsan

exampleof Quinlan’sID3(Playing

 Tennis)

Output: A Decision Tree for“buys_computer”

Page 214: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 214/323

  214

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

 no

fair excellentyesno

 Induction

Basic algorithm (a greedy algorithm)

Page 215: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 215/323

  215

 Tree is constructed in a top-down recursive divide-and-conquermanner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are

discretized in advance)

Examples are partitioned recursively based on selectedattributes  Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class  There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf   There are no samples left

Attribute Selection Measure:Information Gain (ID3/C4.5)

Select the attribute with the highest information

Page 216: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 216/323

  216

Select the attribute with the highest information

gain Let pi be the probability that an arbitrary tuple in D

belongs to class Ci, estimated by |Ci, D |/|D|

Expected information (entropy) needed to classifya tuple in D:

Information needed (after using A to split D into v

partitions) to classify D:

Information gained by branching on attribute A

)(log)( 2

1

i

m

i

i p p D Info ∑=−=

)(||

||)(

1

 j

v

 j

 j

 A D I  D

 D D Info ×=∑

=

(D) Info Info(D)Gain(A)  A−=

Attribute Selection: Information Gain

Class P: buys_computer =“yes”

)0,4(14

4)3,2(

14

5)( +=  I  I  D Info age

Page 217: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 217/323

  217

yes Class N: buys_computer =

“no”

means “age <=30” has

5 out of 14 samples, with 2

yes’es and 3 no’s. Hence

Similarly,

age pi ni I(pi, ni)

<=30 2 3 0.971

31…40 4 0 0

>40 3 2 0.971

694.0)2,3(14

5

1414

=+  I 

048.0) _ (

151.0)(

029.0)(

=

=

=

rating credit Gain

 student Gain

incomeGain

246.0)()()( =−= D Info D InfoageGain ageage income student credit_rating buys_computer 

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

)3,2(14

5 I 

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 =−−== I  D Info

Computing Information-Gain forContinuous-Value Attributes

Let attribute A be a continuous-valued attribute

Page 218: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 218/323

  218

Must determine the best split point for A Sort the value A in increasing order

 Typically, the midpoint between each pair of adjacent

values is considered as a possible split point 

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

 The point with the minimum expected information

requirement for A is selected as the split-point for A

Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2

is the set of tuples in D satisfying A > split-point

Gain Ratio for Attribute Selection(C4.5)

Information gain measure is biased towards

Page 219: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 219/323

  219

g

attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome

the problem (normalization to information gain)

GainRatio(A) = Gain(A)/SplitInfo(A)

Ex.

gain_ratio(income) = 0.029/0.926 = 0.031  The attribute with the maximum gain ratio is selected

as the splitting attribute

)||

||(log

||

||)( 2

1 D

 D

 D

 D DSplitInfo j

v

 j

 j A ×−= ∑=

926.0)14

4(log

14

4)

14

6(log

14

6)

14

4(log

14

4)( 222 =×−×−×−= DSplitInfo  A

Gini index (CART, IBMIntelligentMiner)

If a data set D contains examples from n classes, gini index,

Page 220: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 220/323

  220

gini(D) is defined as

 

where p j is the relative frequency of class j in D

If a data set D is split on A into two subsets D1 and D2, the gini 

index gini(D) is defined as

Reduction in Impurity:

 The attribute provides the smallest ginisplit  (D) (or the largest

reduction in impurity) is chosen to split the node (need to

enumerate all the possible splitting points for each attribute)

∑=

−= n

 j

 p j D gini

1

21)(

)(||

||)(

||

||)( 2

21

1 D gini

 D

 D D gini

 D

 D D gini  A +=

)()()( D gini D gini A gini  A−=∆

Gini index (CART, IBMIntelligentMiner)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Page 221: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 221/323

  221

Suppose the attribute income partitions D into 10 in D1: {low,

medium} and 4 in D2

but gini{medium,high} is 0.30 and thus the best since it is the lowest

All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split

values

Can be modified for categorical attributes

459.014

5

14

91)(

22

=   

  − 

  

  −= D gini

)(14

4)(

14

10)( 11},{ DGini DGini D gini mediumlowincome  

  

  + 

  

  =∈

Comparing Attribute SelectionMeasures

 The three measures, in general, return good results

Page 222: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 222/323

  222

but Information gain:

biased towards multivalued attributes Gain ratio:

tends to prefer unbalanced splits in which onepartition is much smaller than the others

Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized

partitions and purity in both partitions

Other Attribute SelectionMeasures

CHAID: a popular decision tree algorithm, measure based on χ2 test

Page 223: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 223/323

  223

for independence C-SEP: performs better than info. gain and gini index in certain cases

G-statistics: has a close approximation to χ2 distribution

MDL (Minimal Description Length) principle (i.e., the simplest solution

is preferred):

 The best tree as the one that requires the fewest # of bits to both (1)

encode the tree, and (2) encode the exceptions to the tree

Multivariate splits (partition based on multiple variable combinations)

CART: finds multivariate splits based on a linear comb. of attrs.

Which attribute selection measure is the best? Most give good results, none is significantly superior than others

Overfitting and Tree Pruning

Overfitting: An induced tree may overfit the training

Page 224: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 224/323

  224

data  Too many branches, some may reflect anomalies due to noise or

outliers

Poor accuracy for unseen samples

 Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this

would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown” tree—get a

sequence of progressively pruned trees

Use a set of data different from the training data to decide

which is the “best pruned tree”

Enhancements to Basic Decision TreeInduction

Allow for continuous-valued attributes

Page 225: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 225/323

  225

Dynamically define new discrete-valued attributes thatpartition the continuous attribute value into a discrete setof intervals

Handle missing attribute values

Assign the most common value of the attribute Assign probability to each of the possible values

Attribute construction Create new attributes based on existing ones that are

sparsely represented  This reduces fragmentation, repetition, and replication

Classification in Large Databases

Classification—a classical problem extensively

Page 226: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 226/323

  226

studied by statisticians and machine learningresearchers

Scalability: Classifying data sets with millions of 

examples and hundreds of attributes with

reasonable speed Why decision tree induction in data mining?

relatively faster learning speed (than other classificationmethods)

convertible to simple and easy to understand classificationrules

can use SQL queries for accessing databases comparable classification accuracy with other methods

Scalable Decision Tree InductionMethods

SLIQ (EDBT’96 — Mehta et al.) Builds an index for each attribute and only class list and

Page 227: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 227/323

  227

Builds an index for each attribute and only class list andthe current attribute list reside in memory

SPRINT (VLDB’96 — J. Shafer et al.) Constructs an attribute list data structure

PUBLIC (VLDB’98 — Rastogi & Shim) Integrates tree splitting and tree pruning: stop growing the

tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan &

Ganti) Builds an AVC-list (attribute, value, class label)

BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan &Loh) Uses bootstrapping to create several small samples

ca a ty Framewor orRainForest

Separates the scalability aspects from the criteria that

Page 228: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 228/323

  228

determine the quality of the tree

Builds an AVC-list: AVC (Attribute, Value, Class_label)

AVC-set (of an attribute X )

Projection of training dataset onto the attribute X and class label

where counts of individual class label are aggregated

AVC-group (of a node n )

Set of AVC-sets of all predictor attributes at the node n 

Rainforest: Training Set and Its AVC

Sets 

AVC-set on incomeAVC-set on Age Training Examples

Page 229: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 229/323

  229

student Buy_Computer  

yes no

yes 6 1

no 3 4

Age Buy_Computer  

yes no

<=30 3 2

31..40 4 0

>40 3 2

Creditrating

Buy_Computer 

yes no

fair 6 2

excellent 3 3

age income student redit_ratin_com<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

AVC-set on Student 

income Buy_Computer  yes no

high 2 2

medium 4 2

low 3 1

AVC-set oncredit_rating

Data Cube-Based Decision-TreeInduction

Integration of generalization with decision-tree

Page 230: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 230/323

  230

g g

induction (Kamber et al.’97)

Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc.

Low-level concepts, scattered classes, bushy classification-trees

Semantic interpretation problems

Cube-based multi-level classification Relevance analysis at multi-levels

Information-gain analysis with dimension + level

BOAT (Bootstrapped OptimisticAlgorithm for Tree Construction)

Use a statistical technique called bootstrapping to create

Page 231: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 231/323

  231

Use a statistical technique called bootstrapping to create

several smaller samples (subsets), each fits in memory

Each subset is used to create a tree, resulting in several

trees

These trees are examined and used to construct a new

tree T’ 

It turns out that T’ is very close to the tree that would be

generated using the whole data set together 

Adv: requires only two scans of DB, an incremental alg.

Presentation of Classification Results

Page 232: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 232/323

  232

Visualization of a Decision Tree in SGI/MineSet 3.0

Page 233: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 233/323

  233

Interactive Visual Mining by Perception-Based Classification (PBC)

Page 234: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 234/323

  234

Chapter 6. Classification andPrediction

What is classification? What Support Vector Machines (SVM)

Page 235: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 235/323

  235

is prediction?

Issues regarding

classification and prediction

Classification by decision

tree induction

Bayesian classification

Rule-based classification Classification by back

propagation

Associative classification

Lazy learners (or learning from

your neighbors)

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection Summary

Bayesian Classification:Why?

A statistical classifier: performs probabilistic prediction,i.e., predicts class membership probabilities

Page 236: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 236/323

  236

Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve

Bayesian classifier , has comparable performance withdecision tree and selected neural network classifiers

Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect — prior knowledge can be combined withobserved data

Standard: Even when Bayesian methods arecomputationally intractable, they can provide a standardof optimal decision making against which other methodscan be measured

Bayesian Theorem: Basics

Let X be a data sample (“evidence”): class label is unknown

Page 237: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 237/323

  237

Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the

hypothesis holds given the observed data sample X

P(H) ( prior probability ), the initial probability

E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed

P(X|H) ( posteriori probability ), the probability of observing the

sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium

income

Bayesian Theorem

Given training data X, posteriori probability of a

Page 238: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 238/323

  238

hypothesis H, P(H|X), follows the Bayes theorem

Informally, this can be written asposteriori = likelihood x prior/evidence

Predicts X belongs to C2 iff the probability P(Ci|X) is the

highest among all the P(Ck|X) for all the k classes

Practical difficulty: require initial knowledge of many

probabilities, significant computational cost

)()()|(

)|(X

XX

 P  H  P  H  P 

 H  P  =

owar s a ve ayes anClassifier

Let D be a training set of tuples and their associatedclass labels, and each tuple is represented by an n-D

Page 239: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 239/323

  239

p p yattribute vector X = (x1, x2, …, xn)

Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori,

i.e., the maximal P(Ci|X)

 This can be derived from Bayes’ theorem

Since P(X) is constant for all classes, only

needs to be maximized

)(

)()|()|(

X

XX

 P iC  P 

iC  P 

iC  P  =

)()|()|(i

C  P i

C  P i

C  P  XX =

er va on o a ve ayesClassifier

A simplified assumption: attributes are conditionallyindependent (i.e., no dependence relation between

Page 240: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 240/323

  240

attributes):

 This greatly reduces the computation cost: Only countsthe class distribution

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having

value xk for Ak divided by |Ci, D | (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed

based on Gaussian distribution with a mean μ andstandard deviation σ

and P(xk|Ci) is

)|(...)|()|(

1

)|()|(21

C i x P C i x P C i x P n

k C i x P C i P 

nk ×××=∏

=

=X

2

2

2)(

2

1),,( σ  

 µ 

σ  π  σ   µ 

−−

= x

e x g 

),,()|(ii C C k  x g C i P  σ  µ =X

Na ve Bayes an ass er: Tra n ngDatasetage income student redit_ratin com

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

Page 241: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 241/323

  241

Class:

C1:buys_computer =

‘yes’

C2:buys_computer = ‘no’

Data sample

X = (age <=30,

Income = medium,

Student = yes

Credit_rating = Fair)

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

 

Example

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357

Page 242: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 242/323

  242

Compute P(X|Ci) for each classP(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

 P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028

  P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

vo ng e - ro a yProblem

Naïve Bayesian prediction requires each conditional prob. benon-zero. Otherwise, the predicted prob. will be zero

Page 243: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 243/323

  243

Ex. Suppose a dataset with 1000 tuples, income=low (0),income= medium (990), and income = high (10),

Use Laplacian correction (or Laplacian estimator) Adding 1 to each case

Prob(income = low) = 1/1003Prob(income = medium) = 991/1003Prob(income = high) = 11/1003

 The “corrected” prob. estimates are close to their “uncorrected”

counterparts

∏=

=n

k C i xk  P C i X  P 

1

)|()|(

a ve ayes an ass er:Comments

Advantages Easy to implement

Page 244: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 244/323

  244

Good results obtained in most of the cases Disadvantages

Assumption: class conditional independence, therefore lossof accuracy

Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.Symptoms: fever, cough etc., Disease: lung cancer,diabetes, etc.

Dependencies among these cannot be modeled byNaïve Bayesian Classifier

How to deal with these dependencies? Bayesian Belief Networks

Bayesian Belief Networks

Bayesian belief network allows a subset of the

Page 245: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 245/323

  245

variables conditionally independent

A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution

X  Y 

ZP

Nodes: random variables

Links: dependency

X and Y are the parents of Z, and

 Y is the parent of P

No dependency between Z and P

Has no loops or cycles

 

Example

Family

Hi

Smoker The conditional probability

bl (CPT) f i bl

Page 246: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 246/323

  246

History

LungCancer

PositiveXRay

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

table (CPT) for variableLungCancer:

∏=

=n

i

Y  Parents i xi P  x x P  n

1

))(|(),...,( 1

CPT shows the conditional probabilityfor each possible combination of itsparents

Derivation of the probability of a

particular combination of valuesof X, from CPT:

Training Bayesian Networks

Several scenarios:

Gi b h h k d ll i blb bl l l th CPT

Page 247: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 247/323

  247

Given both the network structure and all variablesobservable: learn only the CPTs Network structure known, some hidden variables:

gradient descent (greedy hill-climbing) method,analogous to neural network learning

Network structure unknown, all variables observable:search through the model space to reconstruct network topology 

Unknown structure, all hidden variables: No goodalgorithms known for this purpose

Ref. D. Heckerman: Bayesian networks for datamining

Us ng IF-THEN Ru es orClassification

Represent the knowledge in the form of IF-THEN rules

R: IF age = youth AND student = yes THEN buys_computer = yes

Page 248: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 248/323

  248

Rule antecedent/precondition vs. rule consequent

Assessment of a rule: coverage and accuracy  

ncovers= # of tuples covered by R

ncorrect= # of tuples correctly classified by R

coverage(R) = ncovers/|D| /* D: training data set */accuracy(R) = ncorrect / ncovers

If more than one rule is triggered, need conflict resolution Size ordering: assign the highest priority to the triggering rules that has

the “toughest” requirement (i.e., with the most attribute test )

Class-based ordering: decreasing order of  prevalence or misclassificationcost per class

Rule-based ordering (decision list): rules are organized into one long

priority list, according to some measure of rule quality or by experts

age?

student? credit rating?

<=30 >40

yes

31..40

Rule Extraction from a Decision Tree

Rules are easier to understand than large trees

One rule is created for each path from the root to a leaf 

Page 249: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 249/323

  249

no yes yes no

fair excellentyesno

Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no

IF age = young AND student = yes THEN buys_computer = yes

IF age = mid-age THEN buys_computer = yesIF age = old AND credit_rating = excellent  THEN buys_computer = yes

IF age = young AND credit_rating = fair  THEN buys_computer = no

Each attribute-value pair along a path forms a conjunction:

the leaf holds the class prediction

Rules are mutually exclusive and exhaustive

Ru e Extract on rom t e Tra n ng

Data

Sequential covering algorithm: Extracts rules directly from training data

T i l ti l i l ith FOIL AQ CN2 RIPPER

Page 250: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 250/323

  250

 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER Rules are learned sequentially , each for a given class Ci will cover many

tuples of Ci but none (or few) of the tuples of other classes

Steps:

Rules are learned one at a time

Each time a rule is learned, the tuples covered by the rules are removed

 The process repeats on the remaining tuples unless termination condition,

e.g., when no more training examples or when the quality of a rule returned

is below a user-specified threshold

Comp. w. decision-tree induction: learning a set of rules simultaneously 

How to Learn-One-Rule?

Star with the most general rule possible: condition = empty

Adding new attributes by adopting a greedy depth-first strategy

Pi k th th t t i th l lit

Page 251: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 251/323

  251

Picks the one that most improves the rule quality

Rule-Quality measures: consider both coverage and accuracy

Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition

It favors rules that have high accuracy and cover many positive tuples

Rule pruning based on an independent set of test tuples

Pos/neg are # of positive/negative tuples covered by R.

If FOIL_Prune is higher for the pruned version of R, prune R

)log''

'(log' _ 22

neg  pos pos

neg  pos pos posGain FOIL

+−

+×=

neg  posneg  pos R Prune FOIL +−=)( _ 

Classification: 

di t t i l l l b l

Classification: A MathematicalMapping

Page 252: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 252/323

  252

predicts categorical class labels E.g., Personal homepage classification

xi = (x1, x2, x3, …), yi = +1 or –1 x1 : # of a word “homepage”

x2 : # of a word “welcome” Mathematically

x ∈ X = ℜn, y ∈ Y = {+1, –1} We want a function f: X Y

Linear Classification

Binary ClassificationproblemTh d t b th d

Page 253: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 253/323

  253

 The data above the redline belongs to class ‘x’

 The data below red linebelongs to class ‘o’

Examples: SVM,Perceptron,Probabilistic Classifiersx

xx

xxx

x

x

x

x o

o

o

o o o

o

o

o o

oo

o

Discriminative Classifiers

Advantages prediction accuracy is generally high

As compared to Bayesian methods in general

Page 254: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 254/323

  254

As compared to Bayesian methods – in general robust, works when training examples contain errors fast evaluation of the learned target function

Bayesian networks are normally slow Criticism

long training time difficult to understand the learned function (weights)

Bayesian networks can be used easily for patterndiscovery

not easy to incorporate domain knowledge

Easy in the form of priors on the data or distributions

Perceptron & Winnow

• Vector: x, w

• Scalar: x, y, wx2

Page 255: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 255/323

  255

Input: {(x1, y1), …}

Output: classification functionf(x)

f(xi) > 0 for yi = +1

f(xi) < 0 for yi = -1

f(x) => wx + b = 0

or w1x1+w2x2+b = 0

x1

• Perceptron: updateW additively

• Winnow: update Wmultiplicatively

ass ca on yBackpropagation

Backpropagation: A neural network learning algorithm

Started by psychologists and neurobiologists to develop and test

Page 256: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 256/323

  256

Started by psychologists and neurobiologists to develop and testcomputational analogues of neurons

A neural network: A set of connected input/output units where

each connection has a weight associated with it

During the learning phase, the network learns by adjustingthe weights so as to be able to predict the correct class label

of the input tuples

Also referred to as connectionist learning due to the

connections between units

eura e wor as aClassifier

Weakness Long training time

Require a number of parameters typically best determinedi i ll th t k t l `` t t "

Page 257: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 257/323

  257

Require a number of parameters typically best determinedempirically, e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning

behind the learned weights and of ``hidden units" in the network Strength

High tolerance to noisy data

Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputs Successful on a wide array of real-world data Algorithms are inherently parallel  Techniques have recently been developed for the extraction of 

rules from trained neural networks

A Neuron (= a perceptron)

µ k -

wx

Page 258: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 258/323

  258

 The n-dimensional input vector x is mapped into variable y by means of the scalarproduct and a nonlinear function mapping

 f 

weighted

sum

Input

vector x

output y

Activation

function

weight

vector w

w0

w1

wn

 x0

 x1

 xn

)sign(y

ExampleFor 

n

0i

k ii xw µ += ∑=

A Multi-Layer Feed-Forward NeuralNetwork 

Output vector

Page 259: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 259/323

  259

Output layer

Input layer

Hidden layer

Input vector: X 

wij

∑ +=i

 jiij j Ow I  θ 

 j I  j

e

O−

+

=

1

1

))(1(  j j j j j OT OO Err  −−=

 jk k 

k  j j j w Err OO Err  ∑−= )1(

i jijij O Err l ww )(+=

 j j j Err l )(+=θ θ 

How A Multi-Layer Neural Network Works?

 The inputs to the network correspond to the attributes measured for each training

tuple

Inputs are fed simultaneously into the units making up the input layer

Page 260: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 260/323

  260

Inputs are fed simultaneously into the units making up the input layer

 They are then weighted and fed simultaneously to a hidden layer

 The number of hidden layers is arbitrary, although usually only one

 The weighted outputs of the last hidden layer are input to units making up the

output layer, which emits the network's prediction

 The network is feed-forward in that none of the weights cycles back to an inputunit or to an output unit of a previous layer

From a statistical point of view, networks perform nonlinear regression: Given

enough hidden units and enough training samples, they can closely approximate

any function

Page 261: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 261/323

Backpropagation

Iteratively process a set of training tuples & compare the

network's prediction with the actual known target value

Page 262: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 262/323

  262

For each training tuple, the weights are modified to minimize

the mean squared error between the network's prediction

and the actual target value

Modifications are made in the “backwards” direction: from

the output layer, through each hidden layer down to the first

hidden layer, hence “backpropagation”

Steps Initialize weights (to small random #s) and biases in the network

Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases)  Terminating condition (when error is very small, etc.)

ac propaga on anInterpretability

Efficiency of backpropagation: Each epoch (one interation

through the training set) takes O(|D| * w), with |D| tuples and w 

weights but # of epochs can be exponential to n the number

Page 263: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 263/323

  263

weights, but # of epochs can be exponential to n, the numberof inputs, in the worst case

Rule extraction from networks: network pruning Simplify the network structure by removing weighted links that

have the least effect on the trained network

 Then perform link, unit, or activation value clustering

 The set of input and activation values are studied to derive rules

describing the relationship between the input and hidden unit

layers

Sensitivity analysis: assess the impact that a given inputvariable has on a network output. The knowledge gained from

this analysis can be represented in rules

Associative Classification

Associative classification

Association rules are generated and analyzed for use in classification

Page 264: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 264/323

  264

Search for strong associations between frequent patterns (conjunctions of 

attribute-value pairs) and class labels

Classification: Based on evaluating a set of rules in the form of 

P1 ^ p2 … ^ pl 

“Aclass = C” (conf, sup) Why effective?

It explores highly confident associations among multiple attributes and

may overcome some constraints introduced by decision-tree induction,

which considers only one attribute at a time

In many studies, associative classification has been found to be more

accurate than some traditional classification methods, such as C4.5

Typ ca Assoc at ve ass cat on

Methods

CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) Mine association possible rules in the form of 

C d ( f ib l i ) l l b l

Page 265: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 265/323

  265

Cond-set (a set of attribute-value pairs) class label Build classifier: Organize rules according to decreasing precedence

based on confidence and then support

CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) Classification: Statistical analysis on multiple rules

CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03) Generation of predictive rules (FOIL-like analysis)

High efficiency, accuracy similar to CMAR

RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.SIGMOD’05) Explore high-dimensional classification, using top-k rule groups

Achieve high classification accuracy and high run-time efficiency

A Closer Look at CMAR

CMAR (Classification based on Multiple Association Rules: Li, Han, Pei,ICDM’01)

Efficiency: Uses an enhanced FP-tree that maintains thedistribution of class labels among tuples satisfying each

Page 266: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 266/323

  266

Efficiency: Uses an enhanced FP tree that maintains thedistribution of class labels among tuples satisfying eachfrequent itemset

Rule pruning whenever a rule is inserted into the tree Given two rules, R1 and R2, if the antecedent of R1 is more general

than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned

Prunes rules for which the rule antecedent and class are notpositively correlated, based on a χ2 test of statistical significance Classification based on generated/pruned rules

If only one rule satisfies tuple X, assign the class label of the rule If a rule set S satisfies X, CMAR

divides S into groups according to class labels uses a weighted χ2 measure to find the strongest group

of rules, based on the statistical correlation of ruleswithin a group

assigns X the class label of the strongest group

 

High Accuracy and Efficiency (Cong et al.SIGMOD05)

Page 267: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 267/323

  267

The k -Nearest NeighborAlgorithm

All instances correspond to points in the n-D

spaceTh t i hb d fi d i t f

Page 268: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 268/323

  268

space  The nearest neighbor are defined in terms of Euclidean distance, dist(X

1, X

2)

 Target function could be discrete- or real-valued

For discrete-valued, k -NN returns the mostcommon value among the k training examplesnearest to  x q

Vonoroi diagram: the decision surface induced

by 1-NN for a typical set of training examples  .

 _ +

 _   x q

+

 _   _ +

 _ 

 _ 

+

..

.

. .

Discussion on the k -NNAlgorithm

k-NN for real-valued prediction for a given unknown tuple

Returns the mean values of the k nearest neighbors

Page 269: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 269/323

  269

Returns the mean values of the nearest neighbors Distance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors according to their

distance to the query x q

Give greater weight to closer neighbors

Robust to noisy data by averaging k-nearest neighbors

Curse of dimensionality: distance between neighbors could be

dominated by irrelevant attributes  To overcome it, axes stretch or elimination of the least relevant

attributes

2),( 1i

 xq xd w≡

Page 270: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 270/323

Genetic Algorithms (GA)

Genetic Algorithm: based on an analogy to biological evolution

An initial population is created consisting of randomlyt d l

Page 271: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 271/323

  271

An initial population is created consisting of randomlygenerated rules Each rule is represented by a string of bits

E.g., if A1 and ¬A2 then C2 can be encoded as 100

If an attribute has k > 2 values, k bits can be used

Based on the notion of survival of the fittest, a new population isformed to consist of the fittest rules and their offsprings

 The fitness of a rule is represented by its classification accuracy  

on a set of training examples

Offsprings are generated by crossover and mutation  The process continues until a population P evolves when each

rule in P satisfies a prespecified threshold 

Slow but easily parallelizable

Page 272: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 272/323

Fuzzy SetApproaches

Fuzzy logic uses truth values between 0 0 and 1 0 to

Page 273: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 273/323

  273

Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)

Attribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories {low,medium, high} with fuzzy values calculated

For a given new sample, more than one fuzzy valuemay apply

Each applicable rule contributes a vote for

membership in the categories  Typically, the truth values for each predicted

category are summed, and these sums arecombined

What Is Prediction?

(Numerical) prediction is similar to classification construct a model

use model to predict continuous or ordered value for a giveninput

Page 274: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 274/323

  274

p ginput Prediction is different from classification

Classification refers to predict categorical class label Prediction models continuous-valued functions

Major method for prediction: regression model the relationship between one or more independent orpredictor variables and a dependent or response variable

Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson

regression, log-linear models, regression trees

Linear Regression 

Linear regression: involves a response variable y and a single

predictor variable x

y = w0 + w1 x

Page 275: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 275/323

  275

y 0 1

where w0 (y-intercept) and w1 (slope) are regression coefficients

Method of least squares: estimates the best-fitting straight line

Multiple linear regression: involves more than one predictor variable

 Training data is of the form (X1, y1), (X2, y2),…, (X|D| , y|D| )

Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

Solvable by extension of least square method or using SAS, S-Plus Many nonlinear functions can be transformed into the above

=

=

−−=||

1

2

||

1

)(

))((

1 D

i

i

 D

iii

 x x

 y y x xw xw yw10

−=

Some nonlinear models can be modeled by apolynomial function

A polynomial regression model can be transformedinto linear regression model For example

Nonlinear Regression 

Page 276: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 276/323

  276

A polynomial regression model can be transformedinto linear regression model. For example,

y = w0 + w1 x + w2 x2+ w3 x3

convertible to linear with new variables: x2= x2, x3= x3

y = w0 + w1 x + w2 x2+ w3 x3 Other functions, such as power function, can also be

transformed to linear model Some models are intractable nonlinear (e.g., sum of 

exponential terms) possible to obtain least square estimates through extensivecalculation on more complex formulae

Generalized linear model: Foundation on which linear regression can be applied to modeling

categorical response variables

Other Regression-Based Models

Page 277: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 277/323

  277

g p Variance of y is a function of the mean value of y, not a constant

Logistic regression: models the prob. of some event occurring as a

linear function of a set of predictor variables

Poisson regression: models the data that exhibit a Poisson

distribution

Log-linear models: (for categorical data) Approximate discrete multidimensional prob. distributions

Also useful for data compression and smoothing

Regression trees and model trees  Trees to predict continuous values rather than class labels

Regression Trees and ModelTrees

Regression tree: proposed in CART system (Breiman et al.

1984)

CART: Classification And Regression Trees

Page 278: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 278/323

  278

CART: Classification And Regression Trees

Each leaf stores a continuous-valued prediction

It is the average value of the predicted attribute for the training

tuples that reach the leaf 

Model tree: proposed by Quinlan (1992)

Each leaf holds a regression model—a multivariate linear equation

for the predicted attribute

A more general case than regression tree

Regression and model trees tend to be more accurate than

linear regression when the data are not represented well by a

simple linear model

Predictive modeling: Predict data values orconstruct generalized linear models based on the

database dataO l di t l t

Predictive Modeling in MultidimensionalDatabases

Page 279: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 279/323

  279

database data One can only predict value ranges or category

distributions Method outline:

Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction

Determine the major factors which influence theprediction Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis

Prediction: Numerical Data

Page 280: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 280/323

  280

Prediction: Categorical Data

Page 281: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 281/323

  281

Classifier Accuracy

Measures

classes buy_computer =yes

buy_computer =no

total recognition(%)

buy_computer = yes 6954 46 7000 99.34

buy computer = no 412 2588 3000 86.27

C1 C2

C1  True positive False negative

C2 False positive True negative

Page 282: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 282/323

  282

Accuracy of a classifier M, acc(M): percentage of test set tuplesthat are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CM

i,j

, an entry in a confusion matrix, indicates #of tuples in class i that are labeled by the classifier as class j

Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos /* true positive recognition rate */specificity = t-neg/neg /* true negative recognition rate */precision = t-pos/(t-pos + f-pos)

accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos +neg)  This model can also be used for cost-benefit analysis

buy_computer no 412 2588 3000 86.27

total 7366 2634 10000 95.52

UNIT IV- Cluster Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

Page 283: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 283/323

  283

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering

Methods

4. Partitioning Methods

5. Outlier Analysis

What is Cluster Analysis?

Cluster: a collection of data objects

Similar to one another within the same cluster Dissimilar to the objects in other clusters

Page 284: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 284/323

  284

Dissimilar to the objects in other clusters

Cluster analysis Finding similarities between data according to the

characteristics found in the data and grouping similar dataobjects into clusters

Unsupervised learning: no predefined classes

 Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering: Rich Applicationsand Multidisciplinary Efforts 

Pattern Recognition

Spatial Data Analysis

Page 285: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 285/323

  285

p y Create thematic maps in GIS by clustering feature spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing

Economic Science (especially market research) WWW

Document classification

Cluster Weblog data to discover groups of similar access

patterns

Examples of ClusteringApplications

Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to developd k i

Page 286: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 286/323

  286

targeted marketing programs

Land use: Identification of areas of similar land use in an

earth observation database

Insurance: Identifying groups of motor insurance policy

holders with a high average claim cost

City-planning: Identifying groups of houses according to their

house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters

should be clustered along continent faults

Quality: What Is GoodClustering?

A good clustering method will produce high quality

clusters with

Page 287: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 287/323

  287

high intra-class similarity

low inter-class similarity

 The quality of a clustering result depends on boththe similarity measure used by the method and its

implementation

 The quality of a clustering method is also measuredby its ability to discover some or all of the hidden

patterns

Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in terms

of a distance function, typically metric: d (i, j)

There is a separate “quality” function that measures the

Page 288: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 288/323

  288

 There is a separate quality function that measures the

“goodness” of a cluster.

 The definitions of distance functions are usually very different

for interval-scaled, boolean, categorical, ordinal ratio, and

vector variables. Weights should be associated with different variables based

on applications and data semantics.

It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.

Requirements of Clustering inData Mining

Scalability

Ability to deal with different types of attributes Ability to handle dynamic data

Page 289: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 289/323

  289

y yp Ability to handle dynamic data

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality Incorporation of user-specified constraints

Interpretability and usability

Data Structures

Data matrix

(two modes)

 ... ... ... ... ...

1p x  ...1f  x  ...11 x 

Page 290: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 290/323

  290

Dissimilarity matrix (one mode)

np x  ...nf  x  ...n1 x 

 ... ... ... ... ...

ip x  ...if  x  ...i1 x 

0...)2,()1,(

:::

)2,3()

 ...nd nd 

0d d(3,1

0d(2,1)

0

Type of data in clusteringanalysis

Interval-scaled variables

Page 291: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 291/323

  291

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Interval-valued variables

Standardize data C l l t th b l t d i ti

Page 292: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 292/323

  292

Calculate the mean absolute deviation:

where

Calculate the standardized measurement ( z-score)

Using mean absolute deviation is more robust than

using standard deviation

.)...21

1nf  f  f  f  x x(xnm +++=

|)|...|||(|121 f nf  f  f  f  f  f 

m xm xm xn s −++−+−=

 f 

 f if 

if  s

m x  z 

−=

Similarity and DissimilarityBetween Objects

Distances are normally used to measure the

similarity or dissimilarity between two data objects

Page 293: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 293/323

  293

y y j

Some popular ones include: Minkowski distance:

where i = ( x i1, x i2, …, x ip) and j = ( x  j1, x  j2, …, x  jp) are two p-

dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

qq

 p p

qq

 j x

i x

 j x

i x

 j x

i x jid  )||...|||(|),(

2211−++−+−=

||...||||),(2211 p p j

 xi

 x j

 xi

 x j

 xi

 x jid  −++−+−=

Similarity and DissimilarityBetween Objects (Cont.)

If q = 2, d is Euclidean distance:

)||...|||(|),( 22

22

2

11 j x

i x

j x

i x

j x

i x jid  −++−+−=

Page 294: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 294/323

  294

Properties

d(i,j) ≥ 0

d(i,i) = 0 d(i,j) = d(j,i)

d(i,j) ≤  d(i,k) + d(k,j)

Also, one can use weighted distance, parametric

Pearson product moment correlation, or other

disimilarity measures

2211 p p ji ji ji

Binary Variables

A contingency table for binary data

Distance measure for symmetricd cd c

baba

 sum

+

+

0

1

01

Object i 

Object  j 

Page 295: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 295/323

  295

Distance measure for symmetric

binary variables:

Distance measure for asymmetric

binary variables:  Jaccard coefficient (similarity  

measure for asymmetric binary

variables):

d cba

cb  jid 

+++

+=),(

cbacb  jid +++=),(

 pd bca sum ++

cbaa  ji sim

 Jaccard  ++=),(

Dissimilarity between BinaryVariables

Example

 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Page 296: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 296/323

  296

gender is a symmetric attribute

the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

=++=

=++=

=++=

mary  jimd 

  jim  jack d 

mary  jack d 

Nominal Variables

A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green

Page 297: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 297/323

  297

Method 1: Simple matching m: # of matches, p: total # of variables

Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal

states

 pm p jid  −=),(

Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Page 298: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 298/323

  298

p , g ,

Can be treated like interval-scaled replace x if   by their rank

map the range of each variable onto [0, 1] by replacing i-th object in the f -th variable by

compute the dissimilarity using methods for interval-scaled variables

1

1

=

  f  

if  

if   M  

r  z 

},...,1{ f if 

M r ∈

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential

Page 299: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 299/323

  299

scale, such as AeBt  or Ae-Bt  

Methods:

treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)

apply logarithmic transformation

 y if = log(x if  )

treat them as continuous ordinal data treat their rank as

interval-scaled

Variables of Mixed Types

A database may contain all the six types of 

variables symmetric binary asymmetric binary nominal ordinal

Page 300: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 300/323

  300

symmetric binary, asymmetric binary, nominal, ordinal,interval and ratio

One may use a weighted formula to combine theireffects

f  is binary or nominal:

dij(f) = 0 if xif = x jf  , or dij

(f) = 1 otherwise f  is interval-based: use the normalized distance

f  is ordinal or ratio-scaled compute ranks rif  and

and treat zif  as interval-scaled

)(1

)()(1),(

  f  ij

 p  f  

  f  ij

  f  ij

 p  f  

d   jid 

δ   

δ   

=

=

Σ

Σ=

1

1

−=

 f 

if 

M r 

 z if 

Vector Objects

Vector objects: keywords in documents, gene

features in micro-arrays, etc. Broad applications: information retrieval

Page 301: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 301/323

  301

Broad applications: information retrieval,

biologic taxonomy, etc.

Cosine measure

A variant: Tanimoto coefficient

Major ClusteringApproaches (I)

Partitioning approach:

Construct various partitions and then evaluate them by some criterion,e.g., minimizing the sum of square errors

Page 302: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 302/323

  302

 Typical methods: k-means, k-medoids, CLARANS

Hierarchical approach:

Create a hierarchical decomposition of the set of data (or objects) using

some criterion

 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Density-based approach:

Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

Major ClusteringApproaches (II)

Grid-based approach:

based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

Page 303: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 303/323

  303

Model-based:

A model is hypothesized for each of the clusters and tries to find the best

fit of that model to each other

 Typical methods: EM, SOM, COBWEB Frequent pattern-based:

Based on the analysis of frequent patterns

 Typical methods: pCluster

User-guided or constraint-based: Clustering by considering user-specified or application-specific constraints

 Typical methods: COD (obstacles), constrained clustering

Typical Alternatives to Calculate theDistance between Clusters

Single link: smallest distance between an element in one

cluster and an element in the other, i.e., dis(K i

, K  j

) = min(tip

, t jq

)

Complete link: largest distance between an element in one

Page 304: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 304/323

  304

Complete link: largest distance between an element in one

cluster and an element in the other, i.e., dis(K i, K  j) = max(tip, t jq)

Average: avg distance between an element in one cluster and

an element in the other, i.e., dis(K i, K  j) = avg(tip, t jq)

Centroid: distance between the centroids of two clusters, i.e.,

dis(K i, K  j) = dis(Ci, C j)

Medoid: distance between the medoids of two clusters, i.e.,dis(K i, K  j) = dis(Mi, M j)

Medoid: one chosen, centrally located object in the cluster

entro , a us an ameter

of a Cluster (for numerical datasets)

Centroid: the “middle” of a cluster

Radius square root of average distance from any point of

 N 

t  N i ip

mC 

)(1=

Σ=

Page 305: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 305/323

  305

Radius: square root of average distance from any point of 

the cluster to its centroid

Diameter: square root of average mean squared distance

between all pairs of points in the cluster

 N 

mcipt  N 

i

m R

2)(1

−=

Σ=

)1(

2)(11

−=Σ=Σ=

 N  N 

iqt ipt  N i

 N i

m D

Partitioning Algorithms: BasicConcept

Partitioning method: Construct a partition of a database D of 

n objects into a set of k clusters, s.t., min sum of squared

distance 2

1 )( iK

k  tC −ΣΣ

Page 306: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 306/323

  306

Given a k , find a partition of k clusters that optimizes the

chosen partitioning criterion Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the

center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects

in the cluster

1 )( mim Kmt m t C mi

ΣΣ∈=

The K-Means Clustering

Method 

Given k , the k-means algorithm is implemented

in four steps:P i i bj i k b

Page 307: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 307/323

  307

Partition objects into k nonempty subsets

Compute seed points as the centroids of the clusters

of the current partition (the centroid is the center, i.e.,

mean point , of the cluster) Assign each object to the cluster with the nearest

seed point

Go back to Step 2, stop when no more new

assignment

The K-Means Clustering

Method 

Example

6

7

8

9

10

7

8

9

10

8

9

10

Page 308: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 308/323

  308

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily chooseK object as initial

cluster center

Assigneachobjectstomostsimilarcenter

Updatethecluster

means

Updatetheclustermeans

reassignreassign

ommen s on e - eans 

Method

Strength: Relatively efficient : O(tkn), where n is # objects, k is

# clusters, and t  is # iterations. Normally, k , t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Page 309: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 309/323

  309

Comment: Often terminates at a local optimum. The global

optimum may be found using techniques such as:

deterministic annealing and genetic algorithms

Weakness

Applicable only when mean is defined, then what about

categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

Variations of the K-Means 

Method

A few variants of the k-means which differ in

Selection of the initial k means

Page 310: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 310/323

  310

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

W at Is t e Pro em o t e K-Means

Method?

 The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially

distort the distribution of the data

Page 311: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 311/323

  311

distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in

a cluster as a reference point, medoids can be used, which is

the most centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

T e K -Me o s  uster ng

Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces

Page 312: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 312/323

  312

starts from an initial set of medoids and iteratively replaces

one of the medoids by one of the non-medoids if it improves

the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale

well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)

A Typical K-Medoids Algorithm (PAM)

5

6

7

8

9

10

 Total Cost = 20

6

7

8

9

10

Arbitrar5

6

7

8

9

10

Assignh

Page 313: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 313/323

  313

0

1

2

3

4

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9 10

K=2

ychoosek objectasinitial

medoids

0

1

2

3

4

0 1 2 3 4 5 6 7 8 9 10

eachremainingobjectto

nearestmedoids Randomly select a

nonmedoidobject,Oramdom

Computetotal cost

of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

 Total Cost = 26

Swapping

O andOramdom

If quality isimproved.

Do loop

Until nochange

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

PAM (Partitioning Around Medoids)

(1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster

S l t k t ti bj t bit il

Page 314: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 314/323

  314

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i ,

calculate the total swapping cost TCih

For each pair of i and h,

If TCih < 0, i is replaced by h

 Then assign each non-selected object to the

most similar representative object repeat steps 2-3 until there is no change

PAM Clustering: Total swapping cost

TCih=∑ j C jih

2

3

4

5

6

7

8

9

10

 j 

h

2

3

4

5

6

7

8

9

10

i h

 j 

Page 315: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 315/323

 

0

1

2

0 1 2 3 4 5 6 7 8 9 10

C  jih = 0

0

1

2

0 1 2 3 4 5 6 7 8 9 10

i h

C  jih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

i  t 

 j 

C  jih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h j 

C  jih = d(j, h) - d(j, t)

a s e ro em w

PAM?

Pam is more robust than k-means in the presence

of noise and outliers because a medoid is less

i fl d b tli th t l

Page 316: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 316/323

  316

influenced by outliers or other extreme values

than a mean

Pam works efficiently for small data sets but doesnot scale well for large data sets. O(k(n-k)2 ) for each iteration

where n is # of data,k is # of clusters

Sampling based method,

CLARA(Clustering LARge Applications)

CLARA (Clustering LargeApplications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+

Page 317: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 317/323

  317

It draws multiple samples of the data set, applies PAM 

on each sample, and gives the best clustering as the

output

Strength: deals with larger data sets than PAM

Weakness: Efficiency depends on the sample size

A good clustering based on samples will not necessarilyrepresent a good clustering of the whole data set if the sample

is biased

CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search)

(Ng and Han’94) CLARANS draws sample of neighbors dynamically

Page 318: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 318/323

  318

 The clustering process can be presented as searching a graph

where every node is a potential solution, that is, a set of k  

medoids

If the local optimum is found, CLARANS starts with new randomlyselected node in search for a new local optimum

It is more efficient and scalable than both PAM and CLARA

Focusing techniques and spatial access structures may further

improve its performance (Ester et al.’95)

What Is Outlier Discovery?

What are outliers?  The set of objects are considerably dissimilar from the

remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ...

Page 319: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 319/323

  319

Problem: Define and find outliers in large data sets Applications:

Credit card fraud detection  Telecom fraud detection Customer segmentation Medical analysis

 Statistical

Approaches

e Assume a model underlying distribution that

Page 320: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 320/323

  320

e Assume a model underlying distribution thatgenerates data set (e.g. normal distribution)

Use discordancy tests depending on

data distribution distribution parameter (e.g., mean, variance) number of expected outliers

Drawbacks most tests are for single attribute

In many cases, data distribution may not be known

ut er D scovery: D stance-Base

Approach

Introduced to counter the main limitations imposedby statistical methods We need multi-dimensional analysis without knowing data

distribution

Page 321: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 321/323

  321

Distance-based outlier: A DB(p, D)-outlier is anobject O in a dataset T such that at least a fractionp of the objects in T lies at a distance greater than

D from O Algorithms for mining distance-based outliers

Index-based algorithm Nested-loop algorithm Cell-based algorithm

 Local Outlier

Detection

Distance-based outlierdetection is based on global

distance distribution It encounters difficulties to

Page 322: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 322/323

  322

identify outliers if data is notuniformly distributed

Ex. C1 contains 400 loosely

distributed points, C2 has 100tightly condensed points, 2outlier points o1, o2

Distance-based method cannotidentify o

2as an outlier

Need the concept of localoutlier

Local outlier factor(LOF) Assume outlier is not

crisp Each point has a LOF

Outlier Discovery: Deviation-Based

Approach

Identifies outliers by examining the main

characteristics of objects in a group Objects that “deviate” from this description are

Page 323: CS1004-NOL

8/8/2019 CS1004-NOL

http://slidepdf.com/reader/full/cs1004-nol 323/323

considered outliers

Sequential exception technique

simulates the way in which humans can distinguish