Download - PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce
![Page 1: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/1.jpg)
PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduceDate : 2013/10/16Source : CIKM’12Authors : Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, and Eli UpfaAdvisor : Dr.Jia-ling, KohSpeaker : Wei, Chang
1
![Page 2: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/2.jpg)
Outline• Introduction• Approach• Experiment• Conclusion
2
![Page 3: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/3.jpg)
Association Rules
3
• Market-Basket transactions
![Page 4: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/4.jpg)
Itemset
4
• I={Bread, Milk, Diaper, Beer, Eggs, Coke}• Itemsets
• 1-itemsets: {Beer}, {Milk}, {Bread}, …• 2-itemsets: {Bread, Milk}, {Bread, Beer}, …• 3-itemsets: {Milk, Eggs, Coke}, {Bread, Milk, Diaper},…
• t1 contains {Bread, Milk}, but doesn’t contain {Bread, Beer}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 5: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/5.jpg)
Frequent Itemset
• Support count : (X)• Frequency of occurrence of an itemset X• (X) = |{ti | X ti , ti T}|• E.g. ({Milk, Bread, Diaper}) = 2
• Support• Fraction of transactions that contain an itemset X• s(X) = (X) /|T|• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset• An itemset X s(X) minsup
5
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 6: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/6.jpg)
Association Rule
6
Example:
Beer}{}Diaper,Milk{
Association Rule– X Y, where X and Y are
itemsets– Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support
Fraction of transactions that contain both X and Y
s(XY)= (XY)/|T|
– Confidence How often items in Y appear in
the transactions that contain X c(XY)= (XY)/ (X)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
![Page 7: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/7.jpg)
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup• confidence ≥ minconf
7
![Page 8: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/8.jpg)
Goal• Because :
• Number of transactions • Cost of the existing algorithm, e.g. Apriori, FP-Tree• What can we do in big data ?
• Sampling• Parallel
• Goal : • A MapReduce algorithm for discovering approximate collections
of frequent itemsets or association rules
8
![Page 9: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/9.jpg)
Outline• Introduction• Approach• Experiment• Conclusion
9
![Page 10: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/10.jpg)
Sampling
10
Original Data Sampling Find FI in
Sample
Question : Is the sample always good ?
![Page 11: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/11.jpg)
Definition
11
𝜃
𝜀1𝑓 𝐷(𝐴)
𝜀2
𝑓 𝐴
![Page 12: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/12.jpg)
How many samples do we need?
12
![Page 13: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/13.jpg)
Introduction of MapReduce
13
![Page 14: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/14.jpg)
Concept
14
Total Data
Sampling locally
Sampling locally
Sampling locally
Sampling locally
Sampling locally
Sampling locally
Mininglocally
Mininglocally
Mininglocally
Mininglocally
Mininglocally
Mininglocally
Global Frequent Itemset
![Page 15: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/15.jpg)
PARMA
15
![Page 16: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/16.jpg)
Trade-offs
16
Number of samples
Probability to get the wrong
approximation
![Page 17: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/17.jpg)
In Reduce 2• For each itemset, we have
• Then we use
17
¿
![Page 18: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/18.jpg)
Result• The itemset A is declared globally frequent and will be present in the output if and only if • Let be the shortest interval such that there are at least
N-R+1 elements from that belong to this interval.
18
![Page 19: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/19.jpg)
Association Rules
19
![Page 20: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/20.jpg)
Outline• Introduction• Approach• Experiment• Conclusion
20
![Page 21: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/21.jpg)
Implementation• Amazon Web Service : ml.xlarge• Hadoop with 8 nodes• Parameters :
21
![Page 22: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/22.jpg)
Compare with other Algorithm
22
![Page 23: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/23.jpg)
Runtime in Each Step
23
![Page 24: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/24.jpg)
Acceptable False Positives
24
![Page 25: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/25.jpg)
Error in frequency estimations
25
![Page 26: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/26.jpg)
Outline• Introduction• Approach• Experiment• Conclusion
26
![Page 27: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce](https://reader035.vdocuments.mx/reader035/viewer/2022062520/56816356550346895dd40111/html5/thumbnails/27.jpg)
Conclusion• A parallel algorithm for mining quasi-optimal collections of
frequent itemsets and association rules in MapReduce.• 30-55% runtime improvement over PFP.• Verify the accuracy of the theoretical bounds, as well as show
that in practice our results are orders of magnitude more accurate than is analytically guaranteed.
27