school of computer science carnegie mellon university national taiwan university of science &...

School of Computer ScienceCarnegie Mellon University

National Taiwan University of Science & Technology

Unifying Guilt-by-Association Approaches:

Theorems and Fast Algorithms

Danai KoutraU Kang

Hsing-Kuo Kenneth Pao

Tai-You KeDuen Horng (Polo) Chau

Christos Faloutsos

ECML PKDD, 5-9 September 2011, Athens, Greece

Problem Definition:GBA techniques

Given: graph with N nodes & M edges;

few labeled nodesFind: class (red/green) for rest nodesAssuming: network effects (homophily/ heterophily)

?

?

?

?

© Danai Koutra - PKDD'11


Homophily and Heterophily

Step 1

Step 2

homophily heterophily

All methods handle

homophily

NOT all methods handle

heterophily

BUT

proposed method

does!


Why do we study these methods?


Motivation (1): Law Enforcement

[Tong+ ’06][Lin+ ‘04][Chen+ ’11]…

??

?

???

?||


Motivation (2): Cyber Security

victims?

[Kephart+ ’95][Kolter+ ’06][Song+ ’08-’11][Chau+ ‘11]…

botnet members?

bot


Motivation (3): Fraud Detection

Lax controls?

[Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]…

fraudsters?

fraudster


Motivation (4): Ranking

[Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…

IMPORTANCE


Our Contributions

Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for

linearized BP

Practice FABP algorithm

fast accurate and scalable

Experiments on DBLP, Web, and Kronecker graphs


RoadmapBackground

Belief Propagation Random Walk with Restarts Semi-supervised LearningLinearized BP

Correspondence of Methods

Proposed Algorithm

Experiments

Conclusions


Background

Apologies for diversion…


Background 1: Belief Propagation (BP)

• Iterative message-based method

0.9 0.10.2 0.8

0.3 0.70.9 0.1

1st round2nd round...

until stop criterion fulfilled

• “Propagation matrix”: Homophily

Heterophily

0.9 0.10.1 0.9

class of “sender

”

class of “receiver

”

Usuallysame

diagonal = homophily

factor h

“about-half” homophily

factor hh = h-0.5

0.4 -0.4-0.4 0.4


Background 1: Belief Propagation Equations

[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]


Background 2:Semi-Supervised Learning

• graph-based SSL• use few labeled data & exploit neighborhood

informationSTEP

1

STEP

2

0.8

-0.3

?

?

-0.3

-0.1

0.6

0.8

[Zhou ‘06][Ji, Han ’10]…


Background 3:Personalized Random Walk with Restarts (RWR)

[Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…


Background


Qualitative Comparison of GBA Methods

GBA Method

Heterophily Scalability Convergence

RWR ✗ ✓ ✓SSL ✗ ✓ ✓BP ✓ ✓ ?

FABP ✓ ✓ ✓


Roadmap

Background

Linearized BPCorrespondence of Methods

Proposed Algorithm

Experiments

Conclusions

New work

Previous work


Linearized BP

• Odds ratio

• Maclaurin expansions

BP is approximated byTheorem [Koutra+]

Sketch of proof

0 1 01 0 10 1 0

? 0-10-2

10-2

1 1 1

d1 d2 d3

final beliefs

prior beliefs

scalarconstants

0.5

pi

0 “ ”

1

DETAILS!


Linearized BP vs BP

BP is approximated byLinearized BP

0 1 01 0 10 1 0

? 0-10-2

10-2

1 1 1

d1 d2 d3

linear

non-linear

Belief Propagation

Our proposal:Original [Yedidia+]:


Our Contributions


linearized BP




✓


DETAILS!

Linearized BP converges if

Linearized BP: convergence

Theorem

degree of node n

1-norm < 1OR Frobenius norm < 1

Sketch of proof


Our Contributions


linearized BP




✓✓


Roadmap

Background

Linearized BP


Proposed Algorithm

Experiments

Conclusions



Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 01 0 10 1 0

? 0 1 1

1 1 1

d1 d2 d3

final labels/ beliefs

prior labels/ beliefs

adjacency matrix


RWR ≈ SSL

RWR and SSL identical ifTHEOREM

individual homophily strength of node i (SSL)

fly-outprobability (RWR)

Simplification

global homophily strength of nodes (SSL)

DETAILS!


RWR ≈ SSL: example

similar scores and identical rankings

y = x

RWR scores

SSL

scor

esindividual hom. strength

global hom. strength


Our Contributions


linearized BP




✓✓

✓


Roadmap

Background

Linearized BP


Proposed AlgorithmExperiments

Conclusions


Proposed algorithm: FABP

①Pick the homophily factor

②Solve the linear system

③(opt) If accuracy is low, run BP with prior beliefs .

0 1 01 0 10 1 0

? 0 1 1

1 1 1

d1 d2 d3

0.5

pi

0 “ ”

1


Roadmap

Background

Linearized BP


Proposed Algorithm

Experiments

Conclusions


Datasets

• p% labeled nodes initially YahooWeb: .edu/others | DBLP: AI/not AI

• accuracy computed on hold-out set

dblp

Dataset # nodes # edges

YahooWeb 1,413,511,390 6,636,600,779

Kronecker 1 177,147 1,977,149,596

Kronecker 2 120,552 1,145,744,786

Kronecker 3 59,049 282,416,924

Kronecker 4 19,683 40,333,924

DBLP 37,791 170,794

6 billion!


Specs

• hadoop version 0.20.2• M45 hadoop cluster (Yahoo!)

500 machines 4000 cores 1.5PB total storage 3.5TB of memory

• 100 machines used for the experiments


RoadmapBackground

Linearized BP


Proposed Algorithm

Experiments 1. Accuracy

2. Convergence 3. Sensitivity 4. Scalability 5. ParallelismConclusions


Results (1): Accuracy

All points on the diagonal scores near-identical

dblp

beliefs in BP

belie

fs in

FA

BP

0.3% labels

Scatter plot of beliefs for (h, priors) = (0.5±0.002, 0.5±0.001)

AI

non-AI


Results (2): Convergence

FABP achieves maximum accuracywithin the convergence bounds.

dblpAccuracy wrt hh (priors = ±0.001)

0.3% labels

hh

% a

ccur

acy

frobenius norm

|e_val| = 11-norm

convergence bounds

hh


dblpAccuracy wrt hh (priors = ±0.001)

0.3% labels

hh

% a

ccur

acy

frobenius norm

|e_val| = 11-norm

FABP is robust to the homophily factor hh within the convergence bounds.

Results (3): Sensitivity to the homophily factor

convergence bounds


( For all plots )

Average over 10 runsError bars tiny

hh

% a

ccur

acy

hh

% a

ccur

acy

% a

ccur

acy

prior beliefs’ magnitude

note


Results (3): Sensitivity to the prior beliefs

FABP is robust to the prior beliefs φh.

dblp%

acc

urac

y

prior beliefs’ magnitude

Accuracy wrt priors (hh = ±0.002)

p=5%p=0.1%p=0.3%p=0.5%


Results (4): Scalability

FABP is linear on the number of edges.

# of edges (Kronecker graphs)

runti

me

(min

)


Results (5): Parallelism

FABP ~2x faster & wins/ties on accuracy.

# of steps # of steps

runtime (min)

% a

ccur

acy

% a

ccur

acy

runti

me

(min

)


Roadmap

Background

Linearized BP


Proposed Algorithm

Experiments

Conclusions


Our Contributions


linearized BP




~2x faster

6 billion edges!

same/better

✓✓

✓ ✓

✓


Thanks

• Data

• Funding

NSC

ILLINOISMing Ji, Jiawei Han


Thank you!

[email protected]

% a

ccur

acy

runtime (min)


Q: Can we have multiple classes?

AI

ML

DB

0.7 0.2 0.1

0.2 0.6 0.2

0.1 0.2 0.7

Propagation matrix

A: yes!


Q: Which of the methods do you recommend?

A: (Fast) Belief Propagation

Reasons:• solid bayesian foundation• heterophily and multiple classes

0.7 0.2 0.10.2 0.6 0.2

0.1 0.2 0.7Propagation matrix


Q: Why is FABP faster than BP?

A:• BP 2|E| messages per iteration• FABP |V| records per “power method” iteration

|V| < 2 |E|

school of computer science carnegie mellon university national taiwan university of science &...

Documents

bp danai koutra pkdd11

greece slide

bp algorithm

homophily factor h h

bp rwr ssl linearization

linearized bp practice

bp convergence criteria

background apologies