school of computer science carnegie mellon university national taiwan university of science &...

49
School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece

Upload: jeremy-bollom

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

School of Computer ScienceCarnegie Mellon University

National Taiwan University of Science & Technology

Unifying Guilt-by-Association Approaches:

Theorems and Fast Algorithms

Danai KoutraU Kang

Hsing-Kuo Kenneth Pao

Tai-You KeDuen Horng (Polo) Chau

Christos Faloutsos

ECML PKDD, 5-9 September 2011, Athens, Greece

Page 2: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

Problem Definition:GBA techniques

Given: graph with N nodes & M edges;

few labeled nodesFind: class (red/green) for rest nodesAssuming: network effects (homophily/ heterophily)

?

?

?

?

© Danai Koutra - PKDD'11

Page 3: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Homophily and Heterophily

Step 1

Step 2

homophily heterophily

All methods handle

homophily

NOT all methods handle

heterophily

BUT

proposed method

does!

Page 4: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Why do we study these methods?

Page 5: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Motivation (1): Law Enforcement

[Tong+ ’06][Lin+ ‘04][Chen+ ’11]…

??

?

???

?||

Page 6: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Motivation (2): Cyber Security

victims?

[Kephart+ ’95][Kolter+ ’06][Song+ ’08-’11][Chau+ ‘11]…

botnet members?

bot

Page 7: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Motivation (3): Fraud Detection

Lax controls?

[Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]…

fraudsters?

fraudster

Page 8: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Motivation (4): Ranking

[Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…

IMPORTANCE

Page 9: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs

Page 10: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

RoadmapBackground

Belief Propagation Random Walk with Restarts Semi-supervised LearningLinearized BP

Correspondence of Methods

Proposed Algorithm

Experiments

Conclusions

Page 11: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background

Apologies for diversion…

Page 12: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background 1: Belief Propagation (BP)

• Iterative message-based method

0.9 0.10.2 0.8

0.3 0.70.9 0.1

1st round2nd round...

until stop criterion fulfilled

• “Propagation matrix”: Homophily

Heterophily

0.9 0.10.1 0.9

class of “sender

class of “receiver

Usuallysame

diagonal = homophily

factor h

“about-half” homophily

factor hh = h-0.5

0.4 -0.4-0.4 0.4

Page 13: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background 1: Belief Propagation Equations

[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]

Page 14: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background 2:Semi-Supervised Learning

• graph-based SSL• use few labeled data & exploit neighborhood

informationSTEP

1

STEP

2

0.8

-0.3

?

?

-0.3

-0.1

0.6

0.8

[Zhou ‘06][Ji, Han ’10]…

Page 15: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background 3:Personalized Random Walk with Restarts (RWR)

[Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…

Page 16: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Background

Page 17: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Qualitative Comparison of GBA Methods

GBA Method

Heterophily Scalability Convergence

RWR ✗ ✓ ✓SSL ✗ ✓ ✓BP ✓ ✓ ?

FABP ✓ ✓ ✓

Page 18: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Qualitative Comparison of GBA Methods

GBA Method

Heterophily Scalability Convergence

RWR ✗ ✓ ✓SSL ✗ ✓ ✓BP ✓ ✓ ?

FABP ✓ ✓ ✓

Page 19: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Roadmap

Background

Linearized BPCorrespondence of Methods

Proposed Algorithm

Experiments

Conclusions

New work

Previous work

Page 20: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Linearized BP

• Odds ratio

• Maclaurin expansions

BP is approximated byTheorem [Koutra+]

Sketch of proof

0 1 01 0 10 1 0

? 0-10-2

10-2

1 1 1

d1 d2 d3

final beliefs

prior beliefs

scalarconstants

0.5

pi

0 “ ”

1

DETAILS!

Page 21: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Linearized BP vs BP

BP is approximated byLinearized BP

0 1 01 0 10 1 0

? 0-10-2

10-2

1 1 1

d1 d2 d3

linear

non-linear

Belief Propagation

Our proposal:Original [Yedidia+]:

Page 22: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs

Page 23: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

DETAILS!

Linearized BP converges if

Linearized BP: convergence

Theorem

degree of node n

1-norm < 1OR Frobenius norm < 1

Sketch of proof

Page 24: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs

✓✓

Page 25: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Roadmap

Background

Linearized BP

Correspondence of Methods

Proposed Algorithm

Experiments

Conclusions

Page 26: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Correspondence of Methods

Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 01 0 10 1 0

? 0 1 1

1 1 1

d1 d2 d3

final labels/ beliefs

prior labels/ beliefs

adjacency matrix

Page 27: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

RWR ≈ SSL

RWR and SSL identical ifTHEOREM

individual homophily strength of node i (SSL)

fly-outprobability (RWR)

Simplification

global homophily strength of nodes (SSL)

DETAILS!

Page 28: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

RWR ≈ SSL: example

similar scores and identical rankings

y = x

RWR scores

SSL

scor

esindividual hom. strength

global hom. strength

Page 29: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs

✓✓

Page 30: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Roadmap

Background

Linearized BP

Correspondence of Methods

Proposed AlgorithmExperiments

Conclusions

Page 31: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Proposed algorithm: FABP

①Pick the homophily factor

②Solve the linear system

③(opt) If accuracy is low, run BP with prior beliefs .

0 1 01 0 10 1 0

? 0 1 1

1 1 1

d1 d2 d3

0.5

pi

0 “ ”

1

Page 32: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Roadmap

Background

Linearized BP

Correspondence of Methods

Proposed Algorithm

Experiments

Conclusions

Page 33: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Datasets

• p% labeled nodes initially YahooWeb: .edu/others | DBLP: AI/not AI

• accuracy computed on hold-out set

dblp

Dataset # nodes # edges

YahooWeb 1,413,511,390 6,636,600,779

Kronecker 1 177,147 1,977,149,596

Kronecker 2 120,552 1,145,744,786

Kronecker 3 59,049 282,416,924

Kronecker 4 19,683 40,333,924

DBLP 37,791 170,794

6 billion!

Page 34: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Specs

• hadoop version 0.20.2• M45 hadoop cluster (Yahoo!)

500 machines 4000 cores 1.5PB total storage 3.5TB of memory

• 100 machines used for the experiments

Page 35: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

RoadmapBackground

Linearized BP

Correspondence of Methods

Proposed Algorithm

Experiments 1. Accuracy

2. Convergence 3. Sensitivity 4. Scalability 5. ParallelismConclusions

Page 36: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Results (1): Accuracy

All points on the diagonal scores near-identical

dblp

beliefs in BP

belie

fs in

FA

BP

0.3% labels

Scatter plot of beliefs for (h, priors) = (0.5±0.002, 0.5±0.001)

AI

non-AI

Page 37: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Results (2): Convergence

FABP achieves maximum accuracywithin the convergence bounds.

dblpAccuracy wrt hh (priors = ±0.001)

0.3% labels

hh

% a

ccur

acy

frobenius norm

|e_val| = 11-norm

convergence bounds

hh

Page 38: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

dblpAccuracy wrt hh (priors = ±0.001)

0.3% labels

hh

% a

ccur

acy

frobenius norm

|e_val| = 11-norm

FABP is robust to the homophily factor hh within the convergence bounds.

Results (3): Sensitivity to the homophily factor

convergence bounds

Page 39: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

( For all plots )

Average over 10 runsError bars tiny

hh

% a

ccur

acy

hh

% a

ccur

acy

% a

ccur

acy

prior beliefs’ magnitude

note

Page 40: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Results (3): Sensitivity to the prior beliefs

FABP is robust to the prior beliefs φh.

dblp%

acc

urac

y

prior beliefs’ magnitude

Accuracy wrt priors (hh = ±0.002)

p=5%p=0.1%p=0.3%p=0.5%

Page 41: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Results (4): Scalability

FABP is linear on the number of edges.

# of edges (Kronecker graphs)

runti

me

(min

)

Page 42: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Results (5): Parallelism

FABP ~2x faster & wins/ties on accuracy.

# of steps # of steps

runtime (min)

% a

ccur

acy

% a

ccur

acy

runti

me

(min

)

Page 43: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Roadmap

Background

Linearized BP

Correspondence of Methods

Proposed Algorithm

Experiments

Conclusions

Page 44: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs

~2x faster

6 billion edges!

same/better

✓✓

✓ ✓

Page 45: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Thanks

• Data

• Funding

NSC

ILLINOISMing Ji, Jiawei Han

Page 46: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Thank you!

[email protected]

% a

ccur

acy

runtime (min)

Page 47: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Q: Can we have multiple classes?

AI

ML

DB

0.7 0.2 0.1

0.2 0.6 0.2

0.1 0.2 0.7

Propagation matrix

A: yes!

Page 48: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Q: Which of the methods do you recommend?

A: (Fast) Belief Propagation

Reasons:• solid bayesian foundation• heterophily and multiple classes

0.7 0.2 0.10.2 0.6 0.2

0.1 0.2 0.7Propagation matrix

Page 49: School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems

© Danai Koutra - PKDD'11

Q: Why is FABP faster than BP?

A:• BP 2|E| messages per iteration• FABP |V| records per “power method” iteration

|V| < 2 |E|