scalable mining for classification rules in relational databases מוצג ע ” י : נדב...

Scalable Mining For Classification Rules in Relational Databases

מוצג ע”י : נדב גרוסאוג

Min Wang Bala Iyer Jeffrey Scott Vitter

AbstractAbstract• Problem : Increase in Size of Training Set

• MIND (MINing in Database) Classifier

• Can be Implemented easily over SQL

• Other Classifiers Need O(N) space In Memory.

• MIND Scales Well Over :

• I/O

• # of Processors

Over ViewOver View

• Introduction

• Algorithm

• Database Implementation

• Performance

• Experimental Results

• Conclusions

Introduction - Classification ProblemIntroduction - Classification Problem

no

no

yes

yes

salary <= 62K

safe

safe

risky

Age <= 30

DETAIL TABLEDETAIL TABLE

CLASSIFYERCLASSIFYER

Introduction - Scalability In Introduction - Scalability In ClassificationClassification

Importance Of Scalability:Importance Of Scalability:

• Use a Very Large Training SetUse a Very Large Training Set – Data is Not – Data is Not Memory Resident.Memory Resident.

• Number Of CPUsNumber Of CPUs – better usage of – better usage of resources.resources.

Introduction - Scalability In Introduction - Scalability In ClassificationClassification

Properties of MIND: • Scalable in memory• Scalable In CPU • Uses SQL • Easy to implement

Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)

The Algorithm - DataStracture The Algorithm - DataStracture

DATA in DETAIL TABLE

DETAIL(attr1,attr2,….,class,leaf_num)

attrattrii = i attribute = i attribute

classclass = Class type = Class type

leaf_numleaf_num = the number of leaf the = the number of leaf the example belongs to(this data can be example belongs to(this data can be calculated by the known tree)calculated by the known tree)

The Algorithm - The Algorithm - ginigini index index S - data Set

C - number of Classes

Pi - relative frequency of class i in S

ginigini index : index :

The AlgorithmThe AlgorithmGrowTree(DETAIL TABLE)

Initialize tree T and put all records of DETAIL in rootwhile (some leaf in T is not a STOP node)

for each attribute i doevaluate gini index for each non-STOP leaf

at each split value with respect to attribute ifor each non-STOP leaf do

get the overall best split for it;partition the records and grow the tree for one more level according to best splits;mark all small or pure leaves as STOP nodes;

return T;

Database Implementation - Database Implementation - Dimension tableDimension table

• For Each Attribute and each level of the tree

INSERT INTO DIMi

SELECT leaf_num,class,attri,count(*)

FROM DETAIL

WHERE leaf_num,<> STOP

GROUP BY leaf_num,class,attri

Size of Dimi = #leaves * #distinct values of attri * #classes

Database Implementation - Database Implementation - Dimension table SQL Dimension table SQL

SELECT FROM DETAIL INSERT INTO DIM1 leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP

GROUP BY leaf_num,class,attr1

INSERT INTO DIM2 leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP

GROUP BY leaf_num,class,attr2• • •

Database Implementation - Database Implementation - UP/DOWN - split UP/DOWN - split

for each attribute we find all possible split places:INSERT INTO UP

SELECT d1.leaf_num, d1.attri,d1.class,SUM(d2.count)

FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON d1.leaf_num = d2.leaf_num AND

d2. attri <= d1. attri AND d1.class = d2.class

GROUP BY d1.leaf_num, d1. attri, d1.class

Database Implementation - Class Database Implementation - Class ViewView

create view for each class k and attribute i:

CREATE VIEW Ck_UP(leaf_num,attri,count)SELECT leaf_num,attri,count

FROM UP WHERE class = k

Database Implementation - GINI Database Implementation - GINI VALUEVALUE

create view for all gini values:CREATE VIEW GINI_VALUE(leaf_num,

attri,gini)ASSELECT u1.leaf_num, u1.attri,ƒgini

FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...,Cc_DOWN dc

WHERE u1.attri = .. = uc. attri = .. = dc. attri

AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num

Database Implementation - MIN Database Implementation - MIN GINI VALUEGINI VALUE

create table for minimum gini values for attribute i :

INSERT INTO MIN_GINISELECT leaf_num,i,attri,gini

FROM GINI_VALUE aWHERE a.gini =

(SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num

Database Implementation - Database Implementation - BEST SPLITBEST SPLIT

create view over MIN_GINI for best split :CREATE VIEW BEST_SPLIT

(leaf_num,attr_name,attr_value)SELECT leaf_num, attr_name,attr_value

FROM MIN_GINI aWHERE a.gini =

(SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num

Database Implementation - Database Implementation - PartitioningPartitioning

Build new nodes by spliting old nodes according to BEST_SPLIT values

Set correct node to recoreds:Update leaf_node - is done by a function

No need to UPDATE data or DB

PerformancePerformance

I/O cost of MIND:I/O cost of MIND:

I/O cost of SPRINT:I/O cost of SPRINT:

Experimental ResultsExperimental Results

Normalized time to Normalized time to

finish building the treefinish building the tree

Normalized time to buildNormalized time to build

the tree per examplethe tree per example

Experimental ResultsExperimental Results

Normalized time to buildNormalized time to build

the tree per # of processorthe tree per # of processor

Time to build tree Time to build tree

By Training Set SizeBy Training Set Size

ConclusionsConclusions• MIND works over DB• MIND works well because

– MIND rephrases the classification to a DB problem

– MIND avoid UPDATES the DETAIL table– Parallelism and Scaling Are achived by the use

of RDBMS– MIND uses a user function to get the

performance gain in the DIMi creation.

scalable mining for classification rules in relational databases מוצג ע ” י : נדב...

Documents