[icde 2012] on top-k structural similarity search

29
Pei Lee, ICDE 2012 On Top-k Structural Similarity Search Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China 1 07/03/2022

Upload: pei-lee

Post on 30-Jun-2015

269 views

Category:

Data & Analytics


0 download

DESCRIPTION

In this talk, we talk about the following classic problem: given a node in a graph, how can we efficiently track the top-k similar nodes regarding this node, by simply checking the graph link structure? This talk is accompanying with the ICDE 2012 paper "On Top-k Structural Similarity Search", which can be found at http://www.cs.ubc.ca/~peil/research.html

TRANSCRIPT

Page 1: [ICDE 2012] On Top-k Structural Similarity Search

04/14/2023

1

Pei Lee, ICDE 2012

On Top-k Structural Similarity Search

Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China

Page 2: [ICDE 2012] On Top-k Structural Similarity Search

2

Outline

Problem definition Structural similarity top k structural similarity search

Existing top k structural similarity search methods SimRank, P-Rank Constraints

TopSim: a family of efficient top k structural similarity search algorithms with accuracy guarantee Truncated TopSim, Prioritized TopSim

Experiments

Problem Statement

Page 3: [ICDE 2012] On Top-k Structural Similarity Search

3

Graph structures are ubiquitous

Social networks, citation networks, web graphs, etc

Problem Statement

Page 4: [ICDE 2012] On Top-k Structural Similarity Search

4

What’s structural similarity?

Structural similarity: the pairwise similarity between nodes in a graph

Applications: link prediction, recommendation, etc

Problem Statement

Intuition: two nodes are similar, if their neighbors are similar Derived from PageRank’s intuition

v

a h

b gd

c

e

u

fHow to quantify the similarity between node u and v?

Problem Definition:

Input: ( , ), ,

Output: ( , )

G V E u V v V

S u v

A node is important, if this node is referenced by many other important nodes

Page 5: [ICDE 2012] On Top-k Structural Similarity Search

5

What’s top-k structural similarity search?

Problem Statement

Problem Definition:

Input: ( , ), ,

Output: Top- similar nodes for

G V E v V k

k v

Given a node v in a huge graph Find top-k similar nodes with v But

Definitely do not want to compare with every node The accuracy of results should be guaranteed.

Page 6: [ICDE 2012] On Top-k Structural Similarity Search

6

Existing Structural Similarity Measures

Neighbor-based approaches Jaccard Coefficient, Cosine Similarity, Pearson

correlation, Co-citation, etc Cons: no common neighbors, no similarity!

Random walk based approaches SimRank (Jeh & Widom, KDD’02) P-Rank (Zhao et.al, CIKM’09) (by extending SimRank) Cons:

high computational cost Not designed for top-k similarity search

Related Work

Page 7: [ICDE 2012] On Top-k Structural Similarity Search

7

SimRank & P-Rank

SimRank: two nodes are similar, if they are referenced by similar nodes

Related Work

v

a

cb

u

( , ) 0.5 0S b c

( , ) 0.25 0S u v

( , ) 1S a a 1

( ) ( )

( , ) ( , )| ( ) || ( ) |n n

i I u j I v

CS u v S i j

I u I v

1T

n nS CWS W

Pairwise iterative form:

Matrix form:

In-neighbors

Transition matrix

Correction matrix

P-Rank: two nodes are similar, if they are related with similar nodes

1( ) ( ) ( ) ( )

(1 )( , ) ( , ) ( , )

| ( ) || ( ) | | ( ) || ( ) |n n ni I u j I v i O u j O v

C CS u v S i j S i j

I u I v O u O v

0 < C < 1

0 < λ < 1

SimRank Reversed SimRank

Page 8: [ICDE 2012] On Top-k Structural Similarity Search

8

Top-k similarity search: challenges

Matrix-based approach: (KDD’02, VLDB’08) Offline: compute a |V|-by-|V| similarity matrix SimRank/P-Rank takes O(|E|2) time, which degenerate to O(|

V|4) in the worst case Space cost: hard to store this huge similarity matrix

Vector-based approach: (SDM’10) Offline: compute a vector with length |V| Takes O(|V|D2n) time in the worst case, where n is the

iteration number, D is the average edge degree All these approaches need to access the whole graph to

find the exact top-k similar nodesChallenges

Page 9: [ICDE 2012] On Top-k Structural Similarity Search

9

Contributions

Transform the computation of pairwise similarity on graph G to the computation of authority on G×G, based on a propagation & aggregation process;

Propose TopSim, a local top-k structural similarity search algorithm that avoids accessing the whole graph while the accuracy is guaranteed.

Propose Trun-TopSim-SM and Prio-TopSim-SM, which are two approximations allowing us to trade accuracy for speed.

Contributions

Page 10: [ICDE 2012] On Top-k Structural Similarity Search

10

How TopSim works

Coupling random walk on G

Single random walk on G×G

Propagation & Aggregation

Similarity Path

Similarity Score

Page 11: [ICDE 2012] On Top-k Structural Similarity Search

11

Product of graphs: G×G

Given G(V, E), G×G is defined as For node u and v in G, uv is a node in G×G For edge (e, u) and (e, v) in G, (ee, uv) is an edge in G×G

d

b

u

v

a

c

e

uvce eebd

uu

vu

vv

dd

cb

aada

eccc

ae

ea

Each node pair in G will be a node in G×G Each edge pair in G will be an edge in G×G No need to materialize G×G: only conceptually exists to facilitate analysis

GG×G

Page 12: [ICDE 2012] On Top-k Structural Similarity Search

12

Coupling random walk

Coupling random walk: two random surfers walk simultaneously and follow the same edge direction

Surf1, Surf2 Coupling random walk on G can be equivalently transformed as a single

random walk on G×G SimRank: S(u, v) is the first meeting probability of two random surfers

starting from u and v respectively and following backward links.

d

b

u

v

a

c

e

uvce eebd

uu

vu

vv

dd

cb

aada

eccc

ae

ea

G G×G

Page 13: [ICDE 2012] On Top-k Structural Similarity Search

13

Compute similarity based on coupling random walk

We actually transform a similarity ranking problem on G into an authority ranking problem on G×G R(uv) = S(u, v)

Initialization: Source node (if u = v): R(uv) = 1 is fixed Target node (if u ≠ v): R(uv) = 0 and R(uv) will be updated

How is R(uv) updated? Propagation & Aggregation process on G×G

Propagation: nodes propagate their authority to their neighbors following random walk steps

Aggregation: nodes receive and aggregate the authorities that are propagated-in from their neighbors.

Page 14: [ICDE 2012] On Top-k Structural Similarity Search

14

Compute S(u,v)?

Similarity path: a path from source node to target node without going by source nodes

Probability of a transition step: Similarity:

Sum of similarity paths with end node uv

uvce eebd

uu

vu

vv

dd

cb

aada

eccc

ae

ea

Page 15: [ICDE 2012] On Top-k Structural Similarity Search

15

uvce eebd

uu

vu

vv

dd

cb

aada

eccc

ae

ea

Compute S(u,v): example

1

11

1

1 1

Path 1: (ee, uv)P(ee, uv) = 0.5

If we only consider 3 steps

Path 2: (aa, bd, ce, uv)P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25

S3(u,v) = P(ee, uv)*C + P(aa, bd, ce, uv)*C3 = 0.28

C = 0.5

Page 16: [ICDE 2012] On Top-k Structural Similarity Search

16

TopSim

Page 17: [ICDE 2012] On Top-k Structural Similarity Search

17

Optimization based on SimMap

Observation: many similarity paths are overlapped

v

a h

b gd

c

e

u

1

2

3

f

0

SimMap SM(u) = {(key, value)} key is the node visited by Surf2 on step i when Surf1 visits the node u value = Si(key, u)

SM(v) is exactly the result list TopSim-SM

Example: Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)}

Similarity paths

Page 18: [ICDE 2012] On Top-k Structural Similarity Search

18

Family of TopSim Algorithms

Algorithms Quality Performance

TopSim Exact Slow if the graph is not sparse

TopSim-SM Exact More efficient than TopSim

Trade accuracy for speed More efficient than TopSim-SM

Trade accuracy for speed More efficient than TopSim-SM

Trun-TopSim-SM

Prio-TopSim-SM

Page 19: [ICDE 2012] On Top-k Structural Similarity Search

19

TopSim approximations for Scale-free graphs

Scale-free graphs Some nodes have very high degree Web graphs, citation networks, etc

Random surfers will be trapped by high degree nodes The size of SimMaps will be exploded

Revisit the transition probabilities:

a

n

Page 20: [ICDE 2012] On Top-k Structural Similarity Search

20

TopSim approximations

Basic idea: Only consider similarity paths with higher probability

Truncated TopSim-SM If P(u0u0, …, uivi) < η, stop and ignore this path

Prioritized TopSim-SM Set a buffer size H for each SimMap; Only expand top H nodes in SimMaps:

If | SM(u) | > H, set | SM(u) | = H.

Find accuracy and complexity analysis in paper

Page 21: [ICDE 2012] On Top-k Structural Similarity Search

21

Experiments

Datasets Arxiv High Energy Physics paper citation network,

including 34,546 nodes and 421,578 edges DBLP co-author graph, with 0.92M nodes, 6.1M edges DBLP citation network, with 1.5M papers and 2.1M

citations Live Journal social network, with 4.84M users and

68.99M friendship ties Factors

C = 0.5, η = 0.001, H = 100

Page 22: [ICDE 2012] On Top-k Structural Similarity Search

22

Accuracy of similarity scores

Accuracy ratio Accuracy loss

(Running on Arxiv citation network)

3 steps/iterations are good enough for the accuracy of top-20 list

Page 23: [ICDE 2012] On Top-k Structural Similarity Search

23

Precision@k

(Running on DBLP citation network)

k around 20~30 yields the highest precision

3 steps/iterations yields a high precision

Page 24: [ICDE 2012] On Top-k Structural Similarity Search

24

Kendall Tau distance (care more about the ranking order …)

a

b

a

b

a

b

b

a

concordant discordant

The higher, the better

Page 25: [ICDE 2012] On Top-k Structural Similarity Search

25

Kendall Tau distance (care more about the ranking order …)

k around 20~30 yields the highest precision

3 steps/iterations yields a high precision

Page 26: [ICDE 2012] On Top-k Structural Similarity Search

26

Running time with different node sizes and node degrees

TopSim algorithms are not very sensitive to the graph size

TopSim approximations can handle high degree nodes

Page 27: [ICDE 2012] On Top-k Structural Similarity Search

27

Running time and accessed nodes

Page 28: [ICDE 2012] On Top-k Structural Similarity Search

28

Excitements

We transform a similarity problem on graph G into an equivalent authority ranking problem on the product graph G × G to facilitate analysis;

We propose a family of TopSim algorithms that: Produce top-k results with accuracy guarantee; Only access a small portion of the graph.

Handle both SimRank and P-Rank under the same top k framework.

Questions?

SimRank P-Rank

TopSim

Page 29: [ICDE 2012] On Top-k Structural Similarity Search

29

TopSim-SM

Start from v and find source nodes at each step From level n-1 to 0

Let Surf1 start from source node and walk to node v Let Surf2 start from the same source node and put the visited nodes into

SimMaps When Surf1 visits v, Surf2 will exactly visits the similar nodes of v in

the same step

v

a h

b gd

c

e

u

1

2

3

f

0

Example: Start from c SM(b) = {(d, 1/2), (f, 1/2)} SM(a) = {(e, 1/8)} SM(v) = {(u, 1/32)}