jure leskovecand anandrajaraman · 2/4/2010 jure leskovec & anand rajaraman, stanford cs345a:...

Post on 08-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS345a: Data MiningJure Leskovec and Anand RajaramanjStanford University

… …

Philip S. Yu

IJCAI Q: what is most relatedconference to ICDM 

ICDM

KDD

Ning Zhong

A: Random walks!

conference to ICDM 

SDM

AAAI M. Jordan

R. RamakrishnanA: Random walks!

[Sun, ICDM2005]

NIPS…

2

Conference Author2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 008

0.009

0.011

0.0080.007

0.0050.011

0.005

0.0050 004

0.004

0.0040.004

32/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

Sea Sun Sky Wave{ } { }Cat Forest Grass Tiger

?A: Proximity on 

grpahs!

4

{?, ?, ?,}grpahs!

[Pan KDD2004]2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

Region

Image

Test Image

5

Sea Sun Sky Wave Cat Forest Tiger Grass

Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

Region

Image

Test Image{Grass, Forest, Cat, Tiger}

6

Sea Sun Sky Wave Cat Forest Tiger Grass

Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Tong‐Faloutsos, ‘06]

Connection subgraphs: Connection subgraphs: What is the most likely connection between Andrew McCallum and Yiming Yang:Andrew McCallum and Yiming Yang:

Michael I.Jordan

RongJin

AndrewNg 1 16

2

7

Andrew Yiming

Jordan JinNg

ZoubinJohn D.

2

22 2

3

McCallum Yang

Xiaojin

GhahramaniLaffterty

2

Fernando42

1

42

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

XiaojinZhu

Jian ZhangFernando

C.N. Pereira

[Tong‐Faloutsos, ‘06]

Gi Q d B Given Q query nodes Find Center‐piece subgraph on b nodes

A C

subgraph on b nodes

Input of CEPS: Q Query nodes Q Query nodes Budget b k‐softAND coefficient B

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

A C

[Tong‐Faloutsos, ‘06]

Individual Score Calculation: Individual Score Calculation: Measure proximity of each nodewith respect to individual query node

( , ) n Qr i j

with respect to individual query node

Combine Individual Scores:

Measure importance of a node tothe whole query set

1( , ) nr Q j

“Extract” the connection subgraph: arg max ( )H g H

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

[Tong‐Faloutsos, ‘06]

Goal: Goal: Calculate importance score r(i,j) = ri,j f h d j d h d i for each node j and each query node i

How to do that? How to do that?

102/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Tong‐Faloutsos, ‘06]

I J1

1 1

A BH1 1

1 1D

1 1

1 11E

FG

1 11

11

a.k.a.: Relevance, Closeness, ‘Similarity’…

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

Shortest path is not good:Shortest path is not good:

No influence for degree‐1 nodes (E, F)! known as ‘pizza delivery guy’ problem in undirected graphgraph

Multi‐faceted relationships2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

Max flow is not good: Max‐flow is not good:

Does not punish long paths

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

[Tong‐Faloutsos, ‘06]

I J11 1

A BH1 1

D1 1

1 11•Multiple Connections…

EF

G1 11

• Quality of connection

•Direct & In‐direct connections

14

Direct & In direct connections

•Length, Degree, Weight…2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

910

2

9

12

1

3

8

11

4

56

15

7

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

Node 4

Node 1Node 2Node 3Node 4

0.130.100.130.222

9 10

8

120.13

0.10

0.08

0.04

0.02

0.03

Node 5Node 6Node 7Node 8

0.130.050.050.08

1

4

3

6

811

30.13

0 05

0.04

Node 9Node 10Node 11Node 12

0.040.030.040.02

5 6

7

0.13

0.05

0.05

Ranking vector More red  more relevant

Nearby nodes, higher scores

16

More red, more relevant4r

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

( , ) i jQ i j rj

W : adjacency matrix. 1( )Q I cW

,( , ) i jQ j

i

j

c: damping factor

2c 3c ...W 2W 3WQ I c W WQ

all paths from i all paths from i all paths from i

17

to jwith length 1 to jwith length 2 to jwith length 3

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

(1 )r cWr c e

S i  

(1 )i i ir cWr c e R   b

0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 00.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0

0.13 00.10 0

Ranking vector Starting vectorTransition matrix Restart prob.

13

2

9 10

811

12

0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0.130.220.130.05

0.9

01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0

0.10 00.13 00.220.13 00.05 0

0.1

1

4

5 6

0.050.080.040.030.04

0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2 0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0.05 00.08 00.04 00.03 00.04 0

7

0.02 0 0 0 0 0 0 0 0 0 1/3 1/3 0

2 00.0

n x n n x 1n x 1182/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 0

0001

Query

0.9

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

00

0.10

?? 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/20 0 0 0 0

000

0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 0 0 1/3 1/3 0 0

Adjacency matrix Starting vectorRanking vectorRanking vector Adjacency matrix Starting vectorRanking vectorRanking vector

19

2/4/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 0

00

00

0.130.10

0.30

0.120.18

0.190.09

0.140.13

0.160.10

0.130.10 1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4

0.9

0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 00 0 0 0 1/4 1/2 0 0 0 0 0 0

0

00

0.10

1

01000

0.130.220.130.050 05

1

2

9 10

8

12

0.30.10.300

0.120.350.030.070 07

0.190.180.180.040 04

0.140.260.100.060 06

0.160.210.150.050 05

0.130.220.130.050 05

12

9 10

811

120.130.10

0.08

0.04

0.02

0.03

0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0

0000

0 0 1/4 0 1/3 0 1/2 0

00000

0.050.080.040.030.04

1

4

3

5 6

1100000

0.070.07000

0.040.060.0200.02

0.060.080.010.010.01

0.050.070.020.010.02

0.050.080.040.030.04

43

5 6

7

110.13

0.13

0.05

0.04

0 0 0 0 0 0 0 0 0 1/3 1/3 0 0 0

0.02

70 0 0 0 0.01 0.02

ir

ir

70.05

No pre computation / light storage

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

No pre‐computation / light storageSlow on‐line response O(mE)

Q1 Q2 Q3

Node 1Node 2

0.5767 0.0088 0.00880.1235 0.0076 0.0076

0 00240 0333

0.0088

5

Node 3Node 4Node 5Node 6

0.0283 0.0283 0.02830.0076 0.1235 0.00760.0088 0.5767 0.00880.0076 0.0076 0.123510

11 12

13

4

0.1260

0 02830.0024

0.00760.00240.0333

Node 7Node 8Node 9Node 10

0.0088 0.0088 0.57670.0333 0.0024 0.12600.1260 0.0024 0.03330.1260 0.0333 0.0024

10 133

62

0.1235

0.0283

0.0076

Node 11Node 12Node 13

0.0333 0.1260 0.00240.0024 0.1260 0.03330.0024 0.0333 0.1260

19 8

0.5767

0.1260 0.0333

0.0088

7

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

The RWR score r(i j) The RWR score r(i,j) i … query node, iQ j d i th t k j … node in the network

Combine r(i,j) to get the score r(Q,j)( ,j) g (Q,j) 1) AND query: Meeting probability prob. that |Q| random particles coincide on node jp |Q| p j

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

OR query: OR query: At least one particle arrives to node j:(i d j i i t t t t l t 1 d )(i.e.,  node j is important to at least 1 query node)

k‐SoftAND query:d l k d Node j is important to at least k query nodes

Can be computed recursively:

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

Goal:2

910Goal:

Extract a sub‐graph S on b nodes that maximizes 

4

68

9

1113

the uS r(Q, u) Idea: Iterate until budget is

1 35

4712

14 15 16 Iterate until budget is reached Pick not‐yet‐selected node j (Q j)

29

10

j=arg maxj r(Q,j) Find good paths to all query nodes 4

68

1113

Add the paths to S

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

1 35 712

0 4505 0.0103

5

11 0 07100.10100.1010

0.45055

11 120.00190.00460.0046

0.0103

10

11 12

13

4

0.1010

0.22670.1010

0.0710

10

12

13

4

3

0.0046

0.00240.0046

3

62

0.0710

0 1010 0 1010

0.0710

1 7

3

62

0.0019

0 0046 0 0046

0.0019

AND

1 7

9 80.4505

0.1010 0.1010

0.4505

1 7

9 80.0103

0.0046 0.0046

0.0103

S f AND

25

AND 2‐SoftAND

0 0103

5

0 16170.1617

0.0103

11 124

0 1617 0 1617

0.13870.16170.1617

10 133

0.1617

0.1387

0.08490.1617

0.1387

1 7

62

0.1617 0.1617

26

9 80.0103 0.0103

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

Solving the random walk for each query Solving the random walk for each query separately is time consuming Not appropriate for real time Not appropriate for real‐time

Can we do better? Can we do better?

Idea 1) Pre compute scores for each possible Idea 1) Pre‐compute scores for each possible query node i

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28

1 2 3 4 5 6 7 8 9 10 11 12r r r r r r r r r r r r0.20 0.13 0.14 0.13 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.340.28 0.20 0.13 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.450.14 0.13 0.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.330.13 0.10 0.13 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.32

29 10

8120.13

0.100.08

0.04

0 02

0.03

0.09 0.09 0.09 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.560.03 0.04 0.04 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.220.03 0.04 0.04 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.220.08 0.11 0.04 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.130 03 0 04 0 03 0 28 0 36 0 30 0 30 0 74 1 78 1 00 0 76 0 79

1

43

5 6

8 110.13

0.130.05

0.02

0.04Q‐1:

0.03 0.04 0.03 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.790.04 0.04 0.04 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.800.04 0.05 0.04 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.720.02 0.03 0.02 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05

70.13

0.05

Fast on‐line response

29

Heavy pre‐computation/storage costO(n3 ) O(n2)

[Tong‐Faloutsos‐Pan, ‘06]

Find Communities 12

9 10

812

12

9 10

812

100.04 0.039 10

12

1

43

5 6

7

11

9 1012

1

43

5 6

7

11

1

43

29 10

8 11

120.130.10

0.13

0.080.02

0.04

1

43

28

11

12 78

11

12 7

1

43

2

45 6

70.13

0.05

0.05

4

5 6

7

5 6

7

4

Combine1 3

29 10

8 11

121 3

29 10

8 11

12

30

Fix the remaining4

3

5 6

7

43

5 6

7

10

13

29 10

811

12

12

9 10

812

4

5 6

7

1

43

5 6

811

7 5

7

Within partition links cross partition links

31

Within‐partition links cross‐partition links

9 109 10

1

43

29

811

12

1

43

29

811

12

4

5 6

7

4

5 6

7

32

9 1012

9 1012

31

4

28

11

12

1

43

28

11

12

5 6

7

5 6

7

|S| << |W2|~| | | |

33

Offline stage: Offline stage:

QQ1,1

Q 1,2

=11Q

Q1,k

C i i BridgesCommunities Bridges

8/22/2006 34

Sherman Morrison Woodbury matrix identity: Sherman‐Morrison‐Woodbury matrix identity: Inverse of a rank‐k correction of matrix X can be computed by doing a rank‐k correction to X‐1:computed by doing a rank‐k correction to X :

In our case: In our case:Q‐1=c(I‐W)‐1=c(I‐W1‐W2)‐1=

1 1 11 U VQQ Q Q 8/22/2006 35

1 1 11 1

11U VcQQ Q Q

Online stage: Online stage: Compute column i of:

1 1 11 QQ Q Q

U i

1 1 11 1

11U VcQQ Q Q

Using: 

8/22/2006 36

Most slides are borrowed from the tutorial by Most slides are borrowed from the tutorial by Hanghang Tong and Christos Faloutsos from CMUCMU

http://www.cs.cmu.edu/~htong/tut/cikm2008/cikm_tutorial.html

8/22/2006 37

top related