hits + pagerank

27
HITS + PageRank Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer The slides are licensed under Creative Commons Attribution-ShareAlike 3.0 Lice WS 2010/2011 Web Technologies – Prof. Dr. Ulrik Schroeder

Upload: ajkt

Post on 11-Apr-2017

286 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: HITS + Pagerank

HITS + PageRank

Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer

The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License

WS 2010/2011

Web Technologies – Prof. Dr. Ulrik Schroeder

Page 2: HITS + Pagerank

Overview Motivation HITS

background algorithm drawbacks

PageRank background algorithm problems

Summary HITS, PageRank differences

Sources2

Page 3: HITS + Pagerank

Problem: searching for information on the web

> 1 mio. results, but only the first 10-20 results are relevant

How do search engines decide which sites are important?

What else needs to be considered?

Motivation

3

Page 4: HITS + Pagerank

Motivation Fast and efficient

many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08)

Actuality of results recent changes

Availability of the search engine itself of indexed pages that can be searched (cache)

Resistance against manipulation search result manipulation spam

4

Page 5: HITS + Pagerank

HITS5

Page 6: HITS + Pagerank

HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery

pick out few relevant sources Identify authoritative web pages

most central regarding a certain topic

Question: When can a page be considered authoritative?

66

Page 7: HITS + Pagerank

Two distinct types of pages Authorities

highly referenced pages considered as authoritative

Hubs pages that point to many authorities points from which authority is conferred

Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs

Hubs and Authorities

Hub Authority

7

Page 8: HITS + Pagerank

Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages

execute a user-supplied query use a full text search engine

Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S

Restrictions set of pages pointing to an authority can be enormous

consider fixed-size random subset page links can be internal links for site navigation

exclude links between pages on the same host

88

Page 9: HITS + Pagerank

Root Set (S) and Base Set (T)

ST

9

Page 10: HITS + Pagerank

Hub Weight and Authority Weight Weights associated with each page p

hub weight h(p) authority weight a(p) initialized to 1

Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p

“p → q“ means that page p has a hyperlink to page q

pq

qhpa )(:)(

qp

qaph )(:)(

10

Page 11: HITS + Pagerank

Further Processing Repeat whole update operation k times

ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence

Normalize the weights prevent the values from getting too large normalize after each iteration

n

i

ia1

1)(

n

i

ih1

1)(

11

Page 12: HITS + Pagerank

Output

Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable

We just got our final search results

12

Page 13: HITS + Pagerank

Drawbacks No anti-spam capability

link farms can boost hub score Topic drift

not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent

algorithm is executed for every single search query query is time consuming

computation of root and base set calculation of hub and authority weights

13

Page 14: HITS + Pagerank

PAGERANK1414

Page 15: HITS + Pagerank

Background on PageRank Published in 1998

developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin

exclusively licensed by Google

Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate

15

Page 16: HITS + Pagerank

Main idea Each website has a numeric value called PageRank or

Prestige PageRank computation is based on in- and outlinks

D

C

B

A

 

A B C D

ABCD

16

Page 17: HITS + Pagerank

PageRank Algorithm Surfer follows an outlink of page x with probability px

Therefore the PageRank of a page is Resulting equation system:

17

 

A B C D

ABCD

xx outdegree

p 1

inlinksi

ipiPRxPR )()(

cbad

dbc

ab

da

21

21

21

212121

17

Page 18: HITS + Pagerank

PageRank Algorithm Other scores can be reached by multiplication of all values

with the same factor

18

D=8

C=5

B=2

A=4

i ioutdegree

iPRxPR )()(

18

Page 19: HITS + Pagerank

Problems of the algorithm Rank Sink

after some iterations A and B will have a PageRank of 0 solution: RandomSurfer1919

D

C

B

A

Page 20: HITS + Pagerank

RandomSurfer

Idea: simulate real surfing behavior a real surfer may “teleport“ to another website (back-button,

bookmark, ...) the “damping factor“ d is the probability to follow a regular outlink

20

i

iPRddxPRioutdegree

)()1()(

20

Page 21: HITS + Pagerank

Iterative algorithm PageRank-Iterate(G)

Repeat

Until

Return

21

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

;kP

Page 22: HITS + Pagerank

Iterative algorithm PageRank-Iterate(G)

Repeat

Until

Return

22

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

Step 0:

25.025.025.025.0

0P;kP

00000001,0;85,0 d

0101100011001010

M

AB

CD

Page 23: HITS + Pagerank

Iterative algorithm PageRank-Iterate(G)

Repeat

Until

Return

23

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

Step 1:

;kP

0.46250.25

0,143750,14375

25,025,025,025,0

85,0

15,015,015,015,0

1

TMP

00000001,0;85,0 d

0101100011001010

M

AB

CD

Page 24: HITS + Pagerank

Iterative algorithm PageRank-Iterate(G)

Repeat

Until

Return

24

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

;kP

Final step :

0.402797440.262320850.126192790.20868892

60

P

00000001,0;85,0 d

0101100011001010

M

AB

CD

8524

8,055,242,524,17

20 60P

Page 25: HITS + Pagerank

Properties Strengths

pre-computable fast spam-resistant

minor changes have minor effects Weaknesses

pages only authoritative in general and not on query topic link farms Google-bombs

25

Page 26: HITS + Pagerank

Summary HITS

algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or

whether it links to pages that do so no spam-fighting ability

PageRank each page gets one PageRank that declares its value query-independent spam-resistant

26

Page 27: HITS + Pagerank

Sources Papers about PageRank

Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web

Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006

Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999

Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment Jon Kleinberg et al.: Inferring Web communities from link topology

Book Bing Liu: “Web Data Mining”, 2008

27