scalable mining for classification rules in relational databases מוצג ע ” י : נדב...
TRANSCRIPT
![Page 1: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/1.jpg)
Scalable Mining For Classification Rules in Relational Databases
מוצג ע”י : נדב גרוסאוג
Min Wang Bala Iyer Jeffrey Scott Vitter
![Page 2: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/2.jpg)
AbstractAbstract• Problem : Increase in Size of Training Set
• MIND (MINing in Database) Classifier
• Can be Implemented easily over SQL
• Other Classifiers Need O(N) space In Memory.
• MIND Scales Well Over :
• I/O
• # of Processors
![Page 3: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/3.jpg)
Over ViewOver View
• Introduction
• Algorithm
• Database Implementation
• Performance
• Experimental Results
• Conclusions
![Page 4: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/4.jpg)
Introduction - Classification ProblemIntroduction - Classification Problem
no
no
yes
yes
salary <= 62K
safe
safe
risky
Age <= 30
DETAIL TABLEDETAIL TABLE
CLASSIFYERCLASSIFYER
![Page 5: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/5.jpg)
Introduction - Scalability In Introduction - Scalability In ClassificationClassification
Importance Of Scalability:Importance Of Scalability:
• Use a Very Large Training SetUse a Very Large Training Set – Data is Not – Data is Not Memory Resident.Memory Resident.
• Number Of CPUsNumber Of CPUs – better usage of – better usage of resources.resources.
![Page 6: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/6.jpg)
Introduction - Scalability In Introduction - Scalability In ClassificationClassification
Properties of MIND: • Scalable in memory• Scalable In CPU • Uses SQL • Easy to implement
Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)
![Page 7: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/7.jpg)
The Algorithm - DataStracture The Algorithm - DataStracture
DATA in DETAIL TABLE
DETAIL(attr1,attr2,….,class,leaf_num)
attrattrii = i attribute = i attribute
classclass = Class type = Class type
leaf_numleaf_num = the number of leaf the = the number of leaf the example belongs to(this data can be example belongs to(this data can be calculated by the known tree)calculated by the known tree)
![Page 8: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/8.jpg)
The Algorithm - The Algorithm - ginigini index index S - data Set
C - number of Classes
Pi - relative frequency of class i in S
ginigini index : index :
![Page 9: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/9.jpg)
The AlgorithmThe AlgorithmGrowTree(DETAIL TABLE)
Initialize tree T and put all records of DETAIL in rootwhile (some leaf in T is not a STOP node)
for each attribute i doevaluate gini index for each non-STOP leaf
at each split value with respect to attribute ifor each non-STOP leaf do
get the overall best split for it;partition the records and grow the tree for one more level according to best splits;mark all small or pure leaves as STOP nodes;
return T;
![Page 10: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/10.jpg)
Database Implementation - Database Implementation - Dimension tableDimension table
• For Each Attribute and each level of the tree
INSERT INTO DIMi
SELECT leaf_num,class,attri,count(*)
FROM DETAIL
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attri
Size of Dimi = #leaves * #distinct values of attri * #classes
![Page 11: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/11.jpg)
Database Implementation - Database Implementation - Dimension table SQL Dimension table SQL
SELECT FROM DETAIL INSERT INTO DIM1 leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr1
INSERT INTO DIM2 leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr2• • •
![Page 12: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/12.jpg)
Database Implementation - Database Implementation - UP/DOWN - split UP/DOWN - split
for each attribute we find all possible split places:INSERT INTO UP
SELECT d1.leaf_num, d1.attri,d1.class,SUM(d2.count)
FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON d1.leaf_num = d2.leaf_num AND
d2. attri <= d1. attri AND d1.class = d2.class
GROUP BY d1.leaf_num, d1. attri, d1.class
![Page 13: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/13.jpg)
Database Implementation - Class Database Implementation - Class ViewView
create view for each class k and attribute i:
CREATE VIEW Ck_UP(leaf_num,attri,count)SELECT leaf_num,attri,count
FROM UP WHERE class = k
![Page 14: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/14.jpg)
Database Implementation - GINI Database Implementation - GINI VALUEVALUE
create view for all gini values:CREATE VIEW GINI_VALUE(leaf_num,
attri,gini)ASSELECT u1.leaf_num, u1.attri,ƒgini
FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...,Cc_DOWN dc
WHERE u1.attri = .. = uc. attri = .. = dc. attri
AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num
![Page 15: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/15.jpg)
Database Implementation - MIN Database Implementation - MIN GINI VALUEGINI VALUE
create table for minimum gini values for attribute i :
INSERT INTO MIN_GINISELECT leaf_num,i,attri,gini
FROM GINI_VALUE aWHERE a.gini =
(SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num
![Page 16: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/16.jpg)
Database Implementation - Database Implementation - BEST SPLITBEST SPLIT
create view over MIN_GINI for best split :CREATE VIEW BEST_SPLIT
(leaf_num,attr_name,attr_value)SELECT leaf_num, attr_name,attr_value
FROM MIN_GINI aWHERE a.gini =
(SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num
![Page 17: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/17.jpg)
Database Implementation - Database Implementation - PartitioningPartitioning
Build new nodes by spliting old nodes according to BEST_SPLIT values
Set correct node to recoreds:Update leaf_node - is done by a function
No need to UPDATE data or DB
![Page 18: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/18.jpg)
PerformancePerformance
I/O cost of MIND:I/O cost of MIND:
I/O cost of SPRINT:I/O cost of SPRINT:
![Page 19: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/19.jpg)
Experimental ResultsExperimental Results
Normalized time to Normalized time to
finish building the treefinish building the tree
Normalized time to buildNormalized time to build
the tree per examplethe tree per example
![Page 20: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/20.jpg)
Experimental ResultsExperimental Results
Normalized time to buildNormalized time to build
the tree per # of processorthe tree per # of processor
Time to build tree Time to build tree
By Training Set SizeBy Training Set Size
![Page 21: Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649c775503460f9492c3b6/html5/thumbnails/21.jpg)
ConclusionsConclusions• MIND works over DB• MIND works well because
– MIND rephrases the classification to a DB problem
– MIND avoid UPDATES the DETAIL table– Parallelism and Scaling Are achived by the use
of RDBMS– MIND uses a user function to get the
performance gain in the DIMi creation.