![Page 1: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/1.jpg)
Data Mining Algorithms for Large-Scale Distributed Systems
Presenter: Ran WolffJoint work with Assaf Schuster2003
![Page 2: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/2.jpg)
What is Data Mining?
The automatic analysis of large databaseThe discovery of previously unknown patternsThe generation of a model of the data
![Page 3: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/3.jpg)
Main Data Mining Problems
Association rules Description
Classification Fraud, Churn
Clustering Analysis
He who does this and that will usually do some other thing too
These attributes indicate a good behavior - those indicate bad behavior.
There are three types of entities
![Page 4: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/4.jpg)
Examples – Classification
Customers purchase artifacts in a storeEach transaction is described in terms of a vector of featuresThe owner of the store tries to predict which transactions are fraudulent Example: young men who buy small
electronics during rash-hours Solution: do not respect checks
![Page 5: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/5.jpg)
Examples – Associations
Amazon tracks user queries Suggests to each user additional
books he would usually be interested in
Supermarket finds out “people who buy diapers also buy beer” Place diapers and beer at opposite
sides of the supermarket
![Page 6: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/6.jpg)
Examples – Clustering
Resource location Find the best location for k
distribution centers
Feature selection Find 1000 concepts which summarize
a whole dictionary Extract the meaning out of a
document by replacing each work with the appropriate conceptCar for auto, etc.
![Page 7: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/7.jpg)
Why Mine Data of LSD Systems?
Data mining is goodIt is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collectMany interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes
![Page 8: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/8.jpg)
An Example
Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peersThe result is a 2M peers database of people recreational preferencesMining it, you could discover that Matrix fans are also keen of Radio-Head songs Promote RH performances in Matrix-
Reloaded Ask RH to write the music for Matrix-IV
![Page 9: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/9.jpg)
What is so special about this problem?
Huge systems – Huge amounts of dataDynamic setting System – join / depart Data – constant update
Ad-hoc solutionFast convergence
![Page 10: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/10.jpg)
Our Work
We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore robust Accurate – not approximated Anytime – you get early results fast
![Page 11: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/11.jpg)
In a Teaspoon
A distributed data mining algorithm can be described as a series of distributed decisionsThose decisions are reduced to a majority voteWe developed a majority voting protocol which has all those good qualitiesThe outcome is an LSD association rule mining (still to come: classification)
![Page 12: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/12.jpg)
Problem Definition – Association Rule Mining (ARM)
DBXFreqDBYXFreqDBYXConf
DBDBXSupportDBXFreq
TXDBTDBXSupport
TTTDB
IT
IX
iiiI
k
m
,,,
,,
:,
,...,,
,...,,
21
21
![Page 13: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/13.jpg)
Solution to Traditional ARM
MinConfDBYXConf
MinFreqDBYXFreq
YX
YXDBR
MinConfMinFreqLet
,
,:
10,10
![Page 14: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/14.jpg)
Large-Scale Distributed ARM
tuv
vtt
t
ut
DBRuR
tuvVvu
tuDBDB
at time from reachable is :
,:
![Page 15: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/15.jpg)
Solution of LSD-ARM
No terminationAnytime solution
Recall
Precision
YXYXuR t :~
ttt uRuRuR ~
ttt uRuRuR~~
![Page 16: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/16.jpg)
Majority Vote in LSD Systems
Unknown number of nodes vote 0 or 1 Nodes may dynamically change their vote Edges are dynamically added / removed An infra-structure
detects failureensures message integritymaintains a communication forest
Each node should decide if the global majority is of 0 or 1
![Page 17: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/17.jpg)
Majority Vote in LSD Systems – cont.
Because of the dynamic settings, the algorithm never terminatesInstead we measure the percent of correct outputsIn static periods that percent ought to converge to 100%In stationary periods we will show it converges to a different percentage Assume the overall percentage of ones remains
the same, but they are constantly switched
![Page 18: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/18.jpg)
LSD-Majority Algorithm
Nodes communicates by exchanging messages <s, c>Node u maintains: su – its vote, cu – one (for now) <suv, cuv>– the last <s,c> it had sent
to v <svu, cvu>– the last <s,c> it had
received from v
![Page 19: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/19.jpg)
LSD-Majority – cont.
Node u calculates:
Captures the current knowledge of u
Captures the current agreement between u and v
uu Ev
vuu
Ev
vuuu ccss
uvvuuvvuuv ccss
![Page 20: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/20.jpg)
LSD-Majority – Rational
It is OK if the current knowledge of u is more extreme than what it had agreed with vThe opposite is not OK v might assume u supports its decision
more strongly than u actually does
Tie breaking prefers a negative decision
![Page 21: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/21.jpg)
LSD-Majority – The Protocol
v to, sendthen
and 0
or
and 0
either and 0
or 0 and 0 If
uu Evuwu
wuu
Evuwu
wuu
uvuuv
uvuuv
vuvu
uvuvu
ccss
cc
cc
![Page 22: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/22.jpg)
LSD-Majority – The Protocol
The same decision is applied whenever a message is received su changes an edge fails or recovers
![Page 23: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/23.jpg)
LSD-Majority – Example
![Page 24: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/24.jpg)
![Page 25: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/25.jpg)
LSD-Majority Results
![Page 26: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/26.jpg)
Proof of Correctness
Will be given in class
![Page 27: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/27.jpg)
Back from Majority to ARM
To decide whether an itemset is frequent or not
LSDMrun
set
,set
set
ut
u
ut
u
DBc
DBXSupports
MinFreq
![Page 28: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/28.jpg)
Back from Majority to ARM
To decide whether a rule is confident or not
LSDMrun
,set
,set
set
ut
u
ut
u
DBXSupportc
DBYXSupports
MinConf
![Page 29: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/29.jpg)
Additionally
Create candidates based on the ad-hoc solutionCreate rules on-the-fly rather than upon termination
Our algorithm outputs the correct rules without specifying their global frequency and confidence
![Page 30: Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003](https://reader036.vdocuments.mx/reader036/viewer/2022070412/56649f075503460f94c1ccb3/html5/thumbnails/30.jpg)
Eventual Results
By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.