1 extending link-based algorithms for similar web pages with neighborhood structure allen, zhenjiang...

26
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

Post on 21-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

1

Extending Link-based Algorithms for Similar Web Pages

with Neighborhood Structure

Allen, Zhenjiang LIN CSE, CUHK

13 Dec 2006

Page 2: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

2

Outline

1. Introduction

2. Extended Neighborhood Structure Model

3. Extending Link-based Similarity Measures

4. Experimental Results

5. Conclusion and Future Work

Page 3: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

3

1. Introduction

Background Similarity measures are required in many web

applications to evaluate the similarity between web pages. The “similar pages” service of Web search engines; Web document classification; Web community identification.

Problem Many link-based similarity measures are not so

accurate since they consider only part of the structural information.

Page 4: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

4

1. Introduction

Motivation How to improve the accuracy of link-based similarity

measures by making full use of the structural information?

Contributions Propose the Extended Neighborhood Structure (ENS) model.

bi-direction multi-hop

Construct extended link-based similarity measures base on the ENS model. more flexible and accurate

Page 5: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

5

1. Introduction

Searching the Web Keyword searching

Similarity searching

Search Engine

KEYWORDS: news

http://news.bbc.co.uk/

http://www.cnn.com/ …

Search Engine

URL: www.cnn.com

http://news.bbc.co.uk/

http://usnews.com/ …

similarity measure

Page 6: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

6

Similarity measures Evaluate how similarity or related two objects are.

Approaches to measuring similarity Text-based

Cosine TFIDF [Joachims97]

Link-based Bibliographic coupling [Kessler63] Co-citation [Small73] SimRank [Jeh et al 02], PageSim [Lin et al 06]

Hybrid

1. Introduction

Focus of this talk

Page 7: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

7

Extended Neighborhood Structure (ENS) model

Question: what hide in hyperlinks? similarity relationship between pages, similarity relationship decrease along hyperlinks.

2. Extend Neighborhood Structure Model

Page 8: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

8

2. Extend Neighborhood Structure Model

Extended Neighborhood Structure (ENS) model The ENS model

bi-direction in-link out-link

multi-hop direct (1-hop) indirect (2-hop, 3-hop, etc)

Purpose Improve accuracy of link-based similarity measures by

helping them make full use of the structural information of the Web.

Page 9: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

9

3. Extending Link-based Similarity Measures

Intuition of similarity Similar web pages have similar neighbors.

(to compare two web pages, see their neighbors.)

Notations G=(V, E), |V| = n: the web graph. I(a) / O(a): in-link / out-link neighbors of web page a. path(a1, as): a sequence of vertices a1, a2, …, as such

that (ai, ai+1) ∈ E (i=1,…,s-1) and ai are distinct.

PATH(a,b): the set of all possible paths from page a to b.

Sim(a,b): similarity score of web page a and b.

Page 10: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

10

3. Extending Link-based Similarity Measures

Two classical methods Co-citation: the more common in-link neighbors, the more

similar.

Sim(a,b) = |I(a)∩I(b)| Bibliographic coupling: the more common out-link

neighbors, the more similar.

Sim(a,b) = |O(a)∩O(b)|

Extended Co-citation and Bibliographic Coupling (ECBC) ECBC: the more common neighbors, the more similar.

Sim(a,b) = α|I(a)∩I(b)| + (1-α)|O(a)∩O(b)|, where 0≤α≤1 is a constant.

Page 11: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

11

3. Extending Link-based Similarity Measures

SimRank“two pages are similar if they are linked to by similar

pages”

(1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition

C is a constant between 0 and 1. The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠

v.

( ) ( )( , )

( , )| ( ) | | ( ) |

a I u b I vSim a b

Sim u v CI u I v

Page 12: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

12

3. Extending Link-based Similarity Measures

Extended SimRank“two pages are similar if they have similar neighbors”

(1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition

C is a constant between 0 and 1. The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠

v.

( ) ( ) ( ) ( )( , ) ( , )

( , )| ( ) | | ( ) | | ( ) | | ( ) |

a I u b I v a O u b O vSim a b Sim a b

Sim u v CI u I v O u O v

Page 13: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

13

3. Extending Link-based Similarity Measures

PageSim

“weighted multi-hop” version of Co-citation algorithm.

(a) multi-hop in-link information, and

(b) importance of web pages.

Can be represented by any global scoring system

PageRank scores, or

Authoritative scores of HITS.

Page 14: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

14

3. Extending Link-based Similarity Measures

PageSim (phase 1: feature propagation) Initially, each web page contains an unique feature

information, which is represented by its PageRank score.

The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by

Page 15: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

15

3. Extending Link-based Similarity Measures

PageSim (phase 2: similarity computation) A web page v stores the feature information of its and

others in its Feature Vector FV(v).

The similarity between web page u and v is computed by Jaccard measure [Jain et al 88]

Intuition: the more common feature information two web pages contain, the more similar they are.

Page 16: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

16

3. Extending Link-based Similarity Measures

Extended PageSim (EPS)

Propagating feature information of web pages along in-link hyperlinks at decay rate 1- d.

Computing the in-link PS scores.

EPS(u,v) = in-link PS(u,v) + out-link PS(u,v).

Page 17: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

17

Properties

CC: Co-citation, BC: Bibliographic Coupling, ECBC: Extended Co-citation and Bibliographic Coupling, SR: SimRank, ESR: Extended SimRank, PS: PageSim, EPS: Extended PageSim.

Summary The extended versions consider more structural information. ESR and EPS are bi-directional & multi-hop. In ESR, two web pages are not similar unless there are

intermediate pages between them, even if they link to other (see Figure 1(2)).

3. Extending Link-based Similarity Measures

Page 18: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

18

Case study: Sim(a,b)

Summary The extended algorithms are more flexible. EPS is able to handle more cases.

3. Extending Link-based Similarity Measures

Page 19: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

19

4. Experimental Results

Datasets CSE Web (CW) dataset:

A set of web pages crawled from http://cse.cuhk.edu.hk.

22,000 pages, 180,000 hyperlinks.

The average number of in-links and out-links are 8.6 and 7.7.

Google Scholar (GS) dataset: A set of articles crawled from Google Scholar searching

engine.

Start crawling by submitting “web mining” keywords to GS, and then following the “Cited by” hyperlinks.

20,000 articles, 154,000 citations.

Page 20: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

20

4. Experimental Results

Evaluation Methods Cosine TFIDF similarity (for CW dataset)

A commonly used text-based similarity measure.

“Related Articles” (for GS dataset) A list of related articles to a query article provided by

GS. Can be used as ground truth.

Parameter Settings

Page 21: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

21

4. Experimental Results CC, BC vs ECBC

CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.

GS data (right): x-axis: top N results; y-axis: average precision of all pages.

Page 22: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

22

4. Experimental Results SimRank vs Extended SimRank

CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.

GS data (right): x-axis: top N results; y-axis: average precision of all pages.

Page 23: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

23

4. Experimental Results PageSim vs Extended PageSim

CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.

GS data (right): x-axis: top N results; y-axis: average precision of all pages.

Page 24: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

24

4. Experimental Results Overall Accuracy of Algorithms

Page 25: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

25

5. Conclusion and Future Work

Conclusion Extended Neighborhood Structure model

Bi-direction and multi-hop Extend existing link-based similarity measures

Co-citation, Bibliographic coupling, SimRank, PageSim Experiments

Future Work Extend link-based algorithms based on ENS model Prove the convergence of the Extended SimRank Integrating link-based with text-based

Page 26: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

26

Publications

Z. Lin, M. R. Lyu, and I. King. PageSim: A novel link-based measure of web page similarity. In WWW '06: Proceedings of the 15th international conference on World Wide Web. Pages 1019-1020, Edinburgh, Scotland, 2006.

Z. Lin, I. King, and M. R. Lyu. PageSim: A novel link-based similarity measure for the World Wide Web. In WI ’06: Proceedings of the 5th International Conference on Web Intelligence. ACM Press. To appear, 2006.

Z. Lin, M. R. Lyu, and I. King. Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure. Submitted to WWW’07.