apriori data mining in the cloud
DESCRIPTION
TRANSCRIPT
![Page 1: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/1.jpg)
APRIORI DATA MINING IN THE CLOUD
Case study: d60 Raptor smartAdvisor
Jan NeerbekAlexandra Institute
![Page 2: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/2.jpg)
2
Agenda
· d60: A cloud/data mining case
· Cloud
· Data Mining
· Market Basket Analysis
· Large data sets
· Our solution
![Page 3: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/3.jpg)
3
Alexandra Institute
The Alexandra Institute is a non-profit company that works with application-oriented IT research.
Focus is pervasive computing, and we activate the business potential of our members and customers through research-based userdriven innovation.
![Page 4: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/4.jpg)
4
The case: d60
· Danish company
· A similar products recommendation engine
· d60 was outgrowing their servers (late 2010)
· They saw a potential in moving to Azure
![Page 5: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/5.jpg)
5
The setup
InternetWebshops
Log shoppingpatterns
Do data mining
ProductRecommendations
![Page 6: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/6.jpg)
6
The cloud potential
· Elasticity
· No upfront server cost
· Cheaper licenses
· Faster calculations
![Page 7: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/7.jpg)
7
Challenges
· No SQL Server Analysis Services (SSAS)
· Small compute nodes
· Partioned database (50GB)
· SQL server ingress/outgress access is slow
![Page 8: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/8.jpg)
8
The cloud
Node
Node
Node
Node
Node
Node Node
![Page 9: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/9.jpg)
9
The cloud and services
Node
Node
Node
Node
Node
Node Node
Data layerservice
MessagingService
![Page 10: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/10.jpg)
10
Data layer service
· Application specific (schema/layout)
· SQL, table or other
· Easy a bottleneck
· Can be difficult to scale
Data layerservice
![Page 11: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/11.jpg)
11
Messaging serviceTask Queues
· Standard data structure
· Build-in ordering (FIFO)
· Can be scaled
· Good for asynchronous messages
MessagingService
![Page 12: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/12.jpg)
12
DATA MINING
![Page 13: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/13.jpg)
13
Data mining
Data mining is the use of automated data analysis techniques to uncover relationships among data items
Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals
[about.com/wikipedia.org]
![Page 14: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/14.jpg)
14
Market basket analysis
Customer1
Avocado
Milk
Butter
Potatoes
Customer2
Milk
Diapers
Avocado
Beer
Customer3
Beef
Lemons
Beer
Chips
Customer4
Cereal
Beer
Beef
Diapers
![Page 15: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/15.jpg)
15
Market basket analysis
Customer1
Avocado
Milk
Butter
Potatoes
Customer2
Milk
Diapers
Avocado
Beer
Customer3
Beef
Lemons
Beer
Chips
Customer4
Cereal
Beer
Beef
Diapers
Itemset (Diapers, Beer) occur 50%
Frequency threshold parameterFind as many frequent itemsets as possible
![Page 16: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/16.jpg)
16
Market basket analysis
Popular effective algorithm: FP-growth Based on data structure FP-tree
Requires all data in near-memory Most research in distributed models has been for
cluster setups
![Page 17: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/17.jpg)
17
Building the FP-tree(extends the prefix-tree structure)
Avocado
Butter
Milk
Potatoes
Customer1
Avocado
Milk
Butter
Potatoes
![Page 18: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/18.jpg)
18
Building the FP-tree
Customer2
Milk
Diapers
Avocado
Beer
Avocado
Butter
Milk
Potatoes
![Page 19: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/19.jpg)
19
Building the FP-tree
Customer2
Milk
Diapers
Avocado
Beer
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
![Page 20: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/20.jpg)
20
Building the FP-tree
Customer2
Milk
Diapers
Avocado
Beer
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
![Page 21: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/21.jpg)
21
Building the FP-tree
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
Beef
Beer
Chips
Lemon
Cereal
Diapers
![Page 22: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/22.jpg)
22
FP-growth
Grows the frequent itemsets, recusively
FP-growth(FP-tree tree){ … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }
![Page 23: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/23.jpg)
23
FP-growth algorithmDivide and Conquer
Traverse tree
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
Beef
Beer
Chips
Lemon
Cereal
Diapers
![Page 24: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/24.jpg)
24
FP-growth algorithmDivide and Conquer
Generate sub-trees
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
Beef
Beer
Chips
Lemon
Cereal
Diapers
![Page 25: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/25.jpg)
25
FP-growth algorithmDivide and Conquer
Call recursively
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
Beef
Beer
Chips
Lemon
Cereal
Diapers
Avocado
Butter Beer
Diapers
![Page 26: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/26.jpg)
26
FP-growth algorithmMemory usage
The FP-tree does not fit in local memory; what to do?
· Emulate Distributed Shared Memory
![Page 27: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/27.jpg)
27
Distributed Shared Memory?
· To add nodes is to add memory
· Works best in tightly coubled setups, with low-lantency, high-speed networks
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
Shared Memory
Network
![Page 28: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/28.jpg)
28
FP-growth algorithmMemory usage
The FP-tree does not fit in local memory; what to do?
· Emulate Distributed Shared Memory
· Optimize your data structures
· Buy more RAM
· Get a good idea
![Page 29: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/29.jpg)
29
Get a good idea
· Database scans are serial and can be distributed
· The list of items used in the recursive calls uniquely determines what part of data we are looking at
![Page 30: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/30.jpg)
30
Get a good idea
Avocado
Butter
Milk
Potatoes
Beer
Diapers
Milk
Beef
Beer
Chips
Lemon
Cereal
Diapers
Avocado
Butter Beer
Diapers
![Page 31: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/31.jpg)
31
Get a good idea
Milk
Butter, MilkAvocado
Butter Beer
Diapers
Avocado
Avocado
Beer
Diapers,MilkThese are postfix paths
![Page 32: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/32.jpg)
32
BRINGING IT ALL TOGETHER
![Page 33: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/33.jpg)
33
Buckets
· Use postfix paths for messaging
· Working with buckets
Transactions
Items
![Page 34: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/34.jpg)
34
FP-growth revisited
FP-growth(FP-tree tree){ … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); }
Replaced with postfix
Done in parallel
Done in parallel
Done in parallel
![Page 35: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/35.jpg)
35
Communication
Node Node
Node Node
Data layer
![Page 36: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/36.jpg)
36
Revised Communication
Node Node
Node Node
Data layerMQ
![Page 37: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/37.jpg)
37
Running FP-growth
Distribute buckets
Count items(with postfix size=n)
Collect counts(per postfix)Call recursive
Standard FP-growth
![Page 38: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/38.jpg)
38
Running FP-growth
Distribute buckets
Count items(with postfix size=n)
Collect counts(per postfix)Call recursive
Standard FP-growth
![Page 39: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/39.jpg)
39
Collecting what we have learned
· Message-driven work, using message-queue
· Peer-to-peer for intermediate results
· Distribute data for scalability (buckets)
· Small messages (list of items)
· Allow us to distribute FP-growth
![Page 40: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/40.jpg)
40
Advantages
· Configurable work sizes
· Good distribution of work
· Robust against computer failure
· Fast!
![Page 41: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/41.jpg)
41
So what about performance?
1 2 4 800:00:00
00:29:59
00:59:59
01:29:59
01:59:59
02:29:59
02:59:59
03:29:59
03:59:59
04:29:59
Message-driven FP-growth
FP-growth
Total node time
![Page 42: Apriori data mining in the cloud](https://reader033.vdocuments.mx/reader033/viewer/2022061116/54657218af795971108b4a96/html5/thumbnails/42.jpg)
42
Thank you!
Questions?