p-rank: a comprehensive structural similarity measure over information networks

17
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd , 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou Sun University of Illinois at Urbana-Champaign Presented by Prof. Hong Cheng, CUHK

Upload: dyan

Post on 23-Mar-2016

72 views

Category:

Documents


0 download

DESCRIPTION

P-Rank: A Comprehensive Structural Similarity Measure over Information Networks. Peixiang Zhao, Jiawei Han, Yizhou Sun University of Illinois at Urbana-Champaign. CIKM’ 09 November 3 rd , 2009, Hong Kong. Presented by Prof. Hong Cheng, CUHK. Outline. Introduction & Motivation P-Rank - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank: A Comprehensive Structural Similarity Measure

over Information Networks

CIKM’ 09 November 3rd, 2009, Hong Kong

Peixiang Zhao, Jiawei Han, Yizhou SunUniversity of Illinois at Urbana-Champaign

Presented by Prof. Hong Cheng, CUHK

Page 2: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Outline• Introduction & Motivation• P-Rank

– Formula– Derivatives– Computation

• Experimental Studies• Future direction & Conclusion

CIKM’09 Hong KongNov. 3rd 2009 1 of 15

Page 3: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Introduction• Information Networks (INs)

– Physical, conceptual, and human/societal entities– Interconnected relationships among different

entities• INs are ubiquitous and form a critical component of

modern information infrastructure– The Web– highway or urban transportation networks– research collaboration and publication networks– Biological networks– social networks

CIKM’09 Hong KongNov. 3rd 2009 2 of 15

Page 4: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Problem• Similarity computation on entities of INs

– How similar is webpage A with webpage B in the Web ?

– How similar is researcher A with researcher B in DBLP co-authorship network ?

• First of all, how to define “similarity” within a massive IN?– Textual proximity of entity labels/contents– Structural proximity conveyed through links!

• A good structural similarity measure in INs: SimRank (KDD’02)

CIKM’09 Hong KongNov. 3rd 2009 3 of 15

Page 5: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Why SimRank is not Enough?• Philosophy

– two entities are similar if they are referenced by similar entities

• Potential problems– Semantic incomplete

• Only partial structural information from in-link direction is considered during similarity computation

• Biased similarity results

• May fail in different IN settings !

– Inefficient in computation• Worst-case O(n4), can be improved to O(n3), where n

is the number of vertices in the information network

CIKM’09 Hong KongNov. 3rd 2009 4 of 15

Page 6: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Why SimRank is not Enough?

(a) A Heterogeneous IN and Structural Similarity Scores

(b) A Homogeneous IN and Structural Similarity Scores

CIKM’09 Hong KongNov. 3rd 2009 5 of 15

Page 7: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P(enetrating)-Rank• Philosophy: Two entities are similar, if

1. they are referenced by similar entities2. they reference similar entities

• Advantages– Semantic complete

• Structural information from both in-link and out-link directions are considered during similarity computation

• Robust in different IN settings

– A unified structural similarity framework• SimRank is just a special case

CIKM’09 Hong KongNov. 3rd 2009 6 of 15

Page 8: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank Formula• The structural similarity between vertex a and vertex

b (a ≠ b), s(a, b):– Recursive form

– Approximate iterative form

In-link similarity

Out-link similarity

CIKM’09 Hong KongNov. 3rd 2009 7 of 15

Page 9: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank Property• The iterative P-Rank has the following properties:

– Symmetry: sk(a, b) = sk(b, a)

– Monotonicity: 0 ≤ sk(a, b) ≤ sk+1(a, b) ≤ 1– Existence: The solution to the iterative P-Rank

formula always exists and converges to a fixed point, s(∗, ∗), which is the theoretical solution to the recursive P-Rank formula

– Uniqueness: the solution to the iterative P-Rank formula is unique when C ≠ 1

• The theoretical solution to P-Rank can be reached by a repetitive computation via the iterative form

CIKM’09 Hong KongNov. 3rd 2009 8 of 15

Page 10: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank Derivatives

• P-Rank proposes a unified structural similarity framework, upon which many structural similarity measures are just its special cases

CIKM’09 Hong KongNov. 3rd 2009 9 of 15

Page 11: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank Computation• An iterative algorithm is executed until it reaches the

fixed point– Space complexity: O(n2)– Time complexity: O(n4), can be improved to

O(n3) by amortization• Approximation algorithms on different IN scenarios

– Homogeneous IN• Radius based pruning: vertex-pairs beyond a radius of r are

no longer considered in similarity computation– Heterogeneous IN

• Category based pruning: vertex-pairs in different categories are no longer considered in similarity computation

CIKM’09 Hong KongNov. 3rd 2009 10 of 15

Page 12: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Experimental Studies• Data sets:

– Heterogeneous IN: DBLP (paper, author, conference, year)

– Homogeneous IN: DBLP (paper with citation), Synthetic data R-MAT

• Methods– P-Rank– SimRank

• Metrics– Compactness of clusters– Algorithmic nature– Ground truth

CIKM’09 Hong KongNov. 3rd 2009 11 of 15

Page 13: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Compactness of Clusters• P-Rank and SimRank are used as underlying similarity

measures, respectively, and K-Medoids are used to cluster different vertices– Compactness: intra-cluster distance/inter-

cluster distance

Heterogeneous IN Homogeneous IN

CIKM’09 Hong KongNov. 3rd 2009 12 of 15

Page 14: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Algorithmic Nature

• Iterative P-Rank converges fast to the fixed point

P-Rank v.s. the damping factor C

P-Rank v.s. lambda

CIKM’09 Hong KongNov. 3rd 2009 13 of 15

Page 15: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Ground Truth Ranking Result

• Top-10 ranking results for author vertices in DBLP by P-Rank

CIKM’09 Hong KongNov. 3rd 2009 14 of 15

Page 16: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Conclusion• The proliferation of information networks calls for

effective structural similarity measures in– Ranking– Clustering– Top-k Query Processing– ……

• Compared with SimRank, P-Rank is witnessed to be a more effective structural similarity measure in large information networks– Semantic complete, general, robust, and

flexible enough to be employed in different IN settings

CIKM’09 Hong KongNov. 3rd 2009 15 of 15

Page 17: P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

Thank you

CIKM’ 09 November 3rd, 2009, Hong Kong