simrank : a measure of structural-context similarity

20
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab.

Upload: darcie

Post on 02-Feb-2016

98 views

Category:

Documents


1 download

DESCRIPTION

SimRank : A Measure of Structural-Context Similarity. Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab. Outline. Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SimRank :  A Measure of Structural-Context Similarity

SimRank: A Measure of Structural-Context Similarity

Glen Jeh and Jennifer WidomStanford UniversityACM SIGKDD 2002

January 19, 2011Taikyoung Kim

SNU IDB Lab.

Page 2: SimRank :  A Measure of Structural-Context Similarity

2

Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

Page 3: SimRank :  A Measure of Structural-Context Similarity

3

Introduction Many applications require a measure of “similarity”

between objects– “find-similar-document” query in search engine– Collaborative filtering in a recommender system

Page 4: SimRank :  A Measure of Structural-Context Similarity

4

Introduction Propose a general approach that exploits the object-

to-object relationships in many domains– An algorithm to compute similarity scores between nodes

based on the structural context

Intuition behind the algorithm– Similar objects are related to similar objects– The base case is that objects are similar to themselves

“Two objects are similar if they are referenced by similar objects”

Page 5: SimRank :  A Measure of Structural-Context Similarity

5

Basic Graph Model

G = (V, E) [vertex, edge]– Nodes in V: objects in the domain– Directed edges in E: relationships between objects– <p, q> : from object p to object q

For a node v, denote:– I(v): the set of in-neighbors of v– O(v): the set of out-neighbors of v– Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| )

– Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )

O (Univ)

I (ProfB)

Page 6: SimRank :  A Measure of Structural-Context Similarity

6

Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

Page 7: SimRank :  A Measure of Structural-Context Similarity

7

SimRank Motivation

– Two objects are similar if they are referenced by similar ob-ject

– Consider an object maximally similar to itself (similarity score of 1)

Similar nodes:{ProfA, ProfB},{StudentA, StudentB},{Univ, ProfB},…

Page 8: SimRank :  A Measure of Structural-Context Similarity

8

SimRank

Basic SimRank Equation

The similarity between objects a and b: s(a, b) ∈ [0, 1]

– C is a constant between 0 and 1 Confidence level or decay factor C gives the rate of decay as similarity flows across edges (since

C < 1)

– If a or b may not have any in-neighbors, s(a,b) = 0– SimRank scores are symmetric, i.e., s(a,b) = s(b,a)

Similarity between a and b is the average similar-ity between in-neighbors of a and in-neighbors of b

)()(

b)a (if ))(),(()()(

b)a (if

),( bI

jji

aI

i

bIaIsbIaI

Cbas

11

1

Page 9: SimRank :  A Measure of Structural-Context Similarity

9

SimRank

Basic SimRank Equation

Similarity can be thought of as “propagating” from pair to pair– Consider the derived graph G2=(V2, E2) where

V2=V x V, represents a pair (a,b) of nodes in G An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and

<b,d> exist in G

Page 10: SimRank :  A Measure of Structural-Context Similarity

10

SimRank

Bipartite SimRank

Bipartite domains consist of two types of objects Recommender system

– People are similar if they purchase similar items– Items are similar if they are purchased by similar people

Page 11: SimRank :  A Measure of Structural-Context Similarity

11

SimRank

Bipartite SimRank

Bipartite Equation– Directed edges go from people to items– s(A,B) denote the similarity between persons A and B, (A≠B)

– s(c,d) denote the similarity between items c and d, (c≠d)

– The similarity between persons A and B is the average simi-larity between the items they purchased

– The similarity between items c and d is the average similar-ity between the people who purchased them

)(

1

)(

1

1 ))(),(()()(

),(BO

jji

AO

i

BOAOsBOAO

CBAs

)(

1

)(

1

2 ))(),(()()(

),(dI

jji

cI

i

dIcIsdIcI

Cdcs

Page 12: SimRank :  A Measure of Structural-Context Similarity

12

SimRank

Computing SimRank - Naïve Method

Rk(a,b) gives the score between a and b on iteration k

The values Rk(*,*) are non-decreasing as k increase In experiments, when K = 5, Rk is rapidly converged Complexity

– Space: O(n2) to store the result Rk,

– Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)

) (

) ( ),(

baif

baifbaR

1

00

)(

1

)(

11 ))(),((

)()(),(

bI

jjik

aI

ik bIaIR

bIaI

CbaR

),(),(lim basbaRkk

Page 13: SimRank :  A Measure of Structural-Context Similarity

13

SimRank

Computing SimRank - Pruning

Pruning the logical graph G2

– In naïve method, All n2 nodes of G2 are considered Similarity score are computed for every node-pair

– Nodes far from a node v has less similarity score with v than nodes near v

Pruning– Set the similarity between two nodes far apart to be 0– Consider node-pairs only for nodes which are near each

other in the range of radius r– Complexity

space: O(ndr), dr is average nodes which are near from a node

time: O(Kndrd2)

Page 14: SimRank :  A Measure of Structural-Context Similarity

14

Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

Page 15: SimRank :  A Measure of Structural-Context Similarity

15

Random Surfer-Pairs Model For the intuition of similarity scores, provide an intu-

itive model– Based on “random surfers”– Show the SimRank score s(a,b) measures how soon two ran-

dom surfers are expected to meet at the same node Expected Distance

– u and v are nodes in strongly connected graph– The ED from u to v is exactly the expected number of steps a

random surfer would take before he first reaches v, starting from u

– Tour t = <w1, …, wk>

– l[t]: length of t– P[t]: probability of traveling t

vut

tltPvud:

][][),(

Page 16: SimRank :  A Measure of Structural-Context Similarity

16

Random Surfer-Pairs Model Expected Meeting Distance (EMD)

– EMD is symmetric– EMD m(a,b) is simply the expected distance in G2 from (a,b)

to any singleton node(x,x) ∈ V2

),(),(:

][][),(xxbat

tltPbam

m(v,w)=1m(u,v)=∞m(u,w)=∞

m(*,*)= ∞ m(*,*)= 3

Page 17: SimRank :  A Measure of Structural-Context Similarity

17

Random Surfer-Pairs Model Expected-f Meeting Distance

– Our approach to circumvent the “infinite EMD” problem Map all distances to a finite interval: instead of computing ex-

pected length l(t) of a tour

Equivalence to SimRank– S’(*,*) is exactly models that our original definition of Sim-

Rank scores

)(

),(),(:

][),(' tl

xxbat

ctPbas

Page 18: SimRank :  A Measure of Structural-Context Similarity

18

Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work

Page 19: SimRank :  A Measure of Structural-Context Similarity

19

Conclusion Main contribution

– A formal definition for SimRank similarity scoring over arbi-trary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank

– A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation

– Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank

Page 20: SimRank :  A Measure of Structural-Context Similarity

20

Future Work Address efficiency and scalability issues

– Including additional pruning heuristics and disk-based algo-rithms

Consider ternary (or more) relationships in computing structural-context similarity

Explore the combination of SimRank with other do-main-specific similarity measures