school of computer science carnegie mellon university national taiwan university of science &...
TRANSCRIPT
School of Computer ScienceCarnegie Mellon University
National Taiwan University of Science & Technology
Unifying Guilt-by-Association Approaches:
Theorems and Fast Algorithms
Danai KoutraU Kang
Hsing-Kuo Kenneth Pao
Tai-You KeDuen Horng (Polo) Chau
Christos Faloutsos
ECML PKDD, 5-9 September 2011, Athens, Greece
Problem Definition:GBA techniques
Given: graph with N nodes & M edges;
few labeled nodesFind: class (red/green) for rest nodesAssuming: network effects (homophily/ heterophily)
?
?
?
?
© Danai Koutra - PKDD'11
© Danai Koutra - PKDD'11
Homophily and Heterophily
Step 1
Step 2
homophily heterophily
All methods handle
homophily
NOT all methods handle
heterophily
BUT
proposed method
does!
© Danai Koutra - PKDD'11
Why do we study these methods?
© Danai Koutra - PKDD'11
Motivation (1): Law Enforcement
[Tong+ ’06][Lin+ ‘04][Chen+ ’11]…
??
?
???
?||
© Danai Koutra - PKDD'11
Motivation (2): Cyber Security
victims?
[Kephart+ ’95][Kolter+ ’06][Song+ ’08-’11][Chau+ ‘11]…
botnet members?
bot
© Danai Koutra - PKDD'11
Motivation (3): Fraud Detection
Lax controls?
[Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]…
fraudsters?
fraudster
© Danai Koutra - PKDD'11
Motivation (4): Ranking
[Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…
IMPORTANCE
© Danai Koutra - PKDD'11
Our Contributions
Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for
linearized BP
Practice FABP algorithm
fast accurate and scalable
Experiments on DBLP, Web, and Kronecker graphs
© Danai Koutra - PKDD'11
RoadmapBackground
Belief Propagation Random Walk with Restarts Semi-supervised LearningLinearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Background
Apologies for diversion…
© Danai Koutra - PKDD'11
Background 1: Belief Propagation (BP)
• Iterative message-based method
0.9 0.10.2 0.8
0.3 0.70.9 0.1
1st round2nd round...
until stop criterion fulfilled
• “Propagation matrix”: Homophily
Heterophily
0.9 0.10.1 0.9
class of “sender
”
class of “receiver
”
Usuallysame
diagonal = homophily
factor h
“about-half” homophily
factor hh = h-0.5
0.4 -0.4-0.4 0.4
© Danai Koutra - PKDD'11
Background 1: Belief Propagation Equations
[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
© Danai Koutra - PKDD'11
Background 2:Semi-Supervised Learning
• graph-based SSL• use few labeled data & exploit neighborhood
informationSTEP
1
STEP
2
0.8
-0.3
?
?
-0.3
-0.1
0.6
0.8
[Zhou ‘06][Ji, Han ’10]…
© Danai Koutra - PKDD'11
Background 3:Personalized Random Walk with Restarts (RWR)
[Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…
© Danai Koutra - PKDD'11
Background
© Danai Koutra - PKDD'11
Qualitative Comparison of GBA Methods
GBA Method
Heterophily Scalability Convergence
RWR ✗ ✓ ✓SSL ✗ ✓ ✓BP ✓ ✓ ?
FABP ✓ ✓ ✓
© Danai Koutra - PKDD'11
Qualitative Comparison of GBA Methods
GBA Method
Heterophily Scalability Convergence
RWR ✗ ✓ ✓SSL ✗ ✓ ✓BP ✓ ✓ ?
FABP ✓ ✓ ✓
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BPCorrespondence of Methods
Proposed Algorithm
Experiments
Conclusions
New work
Previous work
© Danai Koutra - PKDD'11
Linearized BP
• Odds ratio
• Maclaurin expansions
BP is approximated byTheorem [Koutra+]
Sketch of proof
0 1 01 0 10 1 0
? 0-10-2
10-2
1 1 1
d1 d2 d3
final beliefs
prior beliefs
scalarconstants
0.5
pi
0 “ ”
1
DETAILS!
© Danai Koutra - PKDD'11
Linearized BP vs BP
BP is approximated byLinearized BP
0 1 01 0 10 1 0
? 0-10-2
10-2
1 1 1
d1 d2 d3
linear
non-linear
Belief Propagation
Our proposal:Original [Yedidia+]:
© Danai Koutra - PKDD'11
Our Contributions
Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for
linearized BP
Practice FABP algorithm
fast accurate and scalable
Experiments on DBLP, Web, and Kronecker graphs
✓
© Danai Koutra - PKDD'11
DETAILS!
Linearized BP converges if
Linearized BP: convergence
Theorem
degree of node n
1-norm < 1OR Frobenius norm < 1
Sketch of proof
© Danai Koutra - PKDD'11
Our Contributions
Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for
linearized BP
Practice FABP algorithm
fast accurate and scalable
Experiments on DBLP, Web, and Kronecker graphs
✓✓
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Correspondence of Methods
Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y
FABP [I + a D - c’A] × bh = φh
0 1 01 0 10 1 0
? 0 1 1
1 1 1
d1 d2 d3
final labels/ beliefs
prior labels/ beliefs
adjacency matrix
© Danai Koutra - PKDD'11
RWR ≈ SSL
RWR and SSL identical ifTHEOREM
individual homophily strength of node i (SSL)
fly-outprobability (RWR)
Simplification
global homophily strength of nodes (SSL)
DETAILS!
© Danai Koutra - PKDD'11
RWR ≈ SSL: example
similar scores and identical rankings
y = x
RWR scores
SSL
scor
esindividual hom. strength
global hom. strength
© Danai Koutra - PKDD'11
Our Contributions
Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for
linearized BP
Practice FABP algorithm
fast accurate and scalable
Experiments on DBLP, Web, and Kronecker graphs
✓✓
✓
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed AlgorithmExperiments
Conclusions
© Danai Koutra - PKDD'11
Proposed algorithm: FABP
①Pick the homophily factor
②Solve the linear system
③(opt) If accuracy is low, run BP with prior beliefs .
0 1 01 0 10 1 0
? 0 1 1
1 1 1
d1 d2 d3
0.5
pi
0 “ ”
1
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Datasets
• p% labeled nodes initially YahooWeb: .edu/others | DBLP: AI/not AI
• accuracy computed on hold-out set
dblp
Dataset # nodes # edges
YahooWeb 1,413,511,390 6,636,600,779
Kronecker 1 177,147 1,977,149,596
Kronecker 2 120,552 1,145,744,786
Kronecker 3 59,049 282,416,924
Kronecker 4 19,683 40,333,924
DBLP 37,791 170,794
6 billion!
© Danai Koutra - PKDD'11
Specs
• hadoop version 0.20.2• M45 hadoop cluster (Yahoo!)
500 machines 4000 cores 1.5PB total storage 3.5TB of memory
• 100 machines used for the experiments
© Danai Koutra - PKDD'11
RoadmapBackground
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments 1. Accuracy
2. Convergence 3. Sensitivity 4. Scalability 5. ParallelismConclusions
© Danai Koutra - PKDD'11
Results (1): Accuracy
All points on the diagonal scores near-identical
dblp
beliefs in BP
belie
fs in
FA
BP
0.3% labels
Scatter plot of beliefs for (h, priors) = (0.5±0.002, 0.5±0.001)
AI
non-AI
© Danai Koutra - PKDD'11
Results (2): Convergence
FABP achieves maximum accuracywithin the convergence bounds.
dblpAccuracy wrt hh (priors = ±0.001)
0.3% labels
hh
% a
ccur
acy
frobenius norm
|e_val| = 11-norm
convergence bounds
hh
© Danai Koutra - PKDD'11
dblpAccuracy wrt hh (priors = ±0.001)
0.3% labels
hh
% a
ccur
acy
frobenius norm
|e_val| = 11-norm
FABP is robust to the homophily factor hh within the convergence bounds.
Results (3): Sensitivity to the homophily factor
convergence bounds
© Danai Koutra - PKDD'11
( For all plots )
Average over 10 runsError bars tiny
hh
% a
ccur
acy
hh
% a
ccur
acy
% a
ccur
acy
prior beliefs’ magnitude
note
© Danai Koutra - PKDD'11
Results (3): Sensitivity to the prior beliefs
FABP is robust to the prior beliefs φh.
dblp%
acc
urac
y
prior beliefs’ magnitude
Accuracy wrt priors (hh = ±0.002)
p=5%p=0.1%p=0.3%p=0.5%
© Danai Koutra - PKDD'11
Results (4): Scalability
FABP is linear on the number of edges.
# of edges (Kronecker graphs)
runti
me
(min
)
© Danai Koutra - PKDD'11
Results (5): Parallelism
FABP ~2x faster & wins/ties on accuracy.
# of steps # of steps
runtime (min)
% a
ccur
acy
% a
ccur
acy
runti
me
(min
)
© Danai Koutra - PKDD'11
Roadmap
Background
Linearized BP
Correspondence of Methods
Proposed Algorithm
Experiments
Conclusions
© Danai Koutra - PKDD'11
Our Contributions
Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for
linearized BP
Practice FABP algorithm
fast accurate and scalable
Experiments on DBLP, Web, and Kronecker graphs
~2x faster
6 billion edges!
same/better
✓✓
✓ ✓
✓
© Danai Koutra - PKDD'11
Thanks
• Data
• Funding
NSC
ILLINOISMing Ji, Jiawei Han
© Danai Koutra - PKDD'11
Q: Can we have multiple classes?
AI
ML
DB
0.7 0.2 0.1
0.2 0.6 0.2
0.1 0.2 0.7
Propagation matrix
A: yes!
© Danai Koutra - PKDD'11
Q: Which of the methods do you recommend?
A: (Fast) Belief Propagation
Reasons:• solid bayesian foundation• heterophily and multiple classes
0.7 0.2 0.10.2 0.6 0.2
0.1 0.2 0.7Propagation matrix
© Danai Koutra - PKDD'11
Q: Why is FABP faster than BP?
A:• BP 2|E| messages per iteration• FABP |V| records per “power method” iteration
|V| < 2 |E|