lecture 8: linkage algorithms and web search · : “new ibm optical chip” : “ibm faculty award...

222
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group [email protected] 2016 1 Adapted from Simone Teufel’s original slides 325

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Lecture 8: Linkage algorithms and web searchInformation Retrieval

Computer Science Tripos Part II

Ronan Cummins1

Natural Language and Information Processing (NLIP) Group

[email protected]

2016

1Adapted from Simone Teufel’s original slides325

Page 2: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Overview

1 Recap

2 Anchor text

3 PageRank

4 Wrap up

Page 3: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

326

Page 4: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

326

Page 5: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))

326

Page 6: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

326

Page 7: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)

326

Page 8: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseeds

326

Page 9: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseedsMinimize avg square within-cluster difference

326

Page 10: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseedsMinimize avg square within-cluster difference

Hierarchical clustering

326

Page 11: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseedsMinimize avg square within-cluster difference

Hierarchical clustering

Best algorithms O(n2) complexity

326

Page 12: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseedsMinimize avg square within-cluster difference

Hierarchical clustering

Best algorithms O(n2) complexitySingle-link vs. complete-link (vs. group-average)

326

Page 13: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Summary: clustering and classification

Clustering is unsupervised learning

Partitional clustering

Provides less information but is more efficient (best: O(kn))K -means

Complexity O(kmni)Guaranteed to converge, non-optimal, dependence on initialseedsMinimize avg square within-cluster difference

Hierarchical clustering

Best algorithms O(n2) complexitySingle-link vs. complete-link (vs. group-average)

Hierarchical and non-hierarchical clustering fulfills differentneeds (e.g. visualisation vs. navigation)

326

Page 14: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Upcoming today

327

Page 15: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Upcoming today

Anchor text: What exactly are links on the web and why arethey important for IR?

327

Page 16: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Upcoming today

Anchor text: What exactly are links on the web and why arethey important for IR?

PageRank: the original algorithm that was used for link-basedranking on the web

327

Page 17: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Upcoming today

Anchor text: What exactly are links on the web and why arethey important for IR?

PageRank: the original algorithm that was used for link-basedranking on the web

Hubs & Authorities: an alternative link-based rankingalgorithm

327

Page 18: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Overview

1 Recap

2 Anchor text

3 PageRank

4 Wrap up

Page 19: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

328

Page 20: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

328

Page 21: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

328

Page 22: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.

328

Page 23: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.

Assumption 2: The anchor text describes the content of d2.

328

Page 24: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.

Assumption 2: The anchor text describes the content of d2.

We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.

328

Page 25: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.

Assumption 2: The anchor text describes the content of d2.

We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.Example: “You can find cheap cars <ahref=http://...>here</a>.”

328

Page 26: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

The web as a directed graph

page d1 anchor text page d2hyperlink

Assumption 1: A hyperlink is a quality signal.

The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.

Assumption 2: The anchor text describes the content of d2.

We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.Example: “You can find cheap cars <ahref=http://...>here</a>.”Anchor text: “You can find cheap cars here”

328

Page 27: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

329

Page 28: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

329

Page 29: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

329

Page 30: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright page

329

Page 31: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pages

329

Page 32: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia article

329

Page 33: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!

329

Page 34: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics

329

Page 35: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics

Searching on [anchor text → d2] is better for the query IBM.

329

Page 36: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

[text of d2] only vs. [text of d2] + [anchor text → d2]

Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.

Example: Query IBM

Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics

Searching on [anchor text → d2] is better for the query IBM.

In this representation, the page with the most occurrences ofIBM is www.ibm.com.

329

Page 37: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Anchor text containing IBM pointing to www.ibm.com

330

Page 38: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Anchor text containing IBM pointing to www.ibm.com

www.nytimes.com: “IBM acquires Webify”

www.slashdot.org: “New IBM optical chip”

www.stanford.edu: “IBM faculty award recipients”

wwww.ibm.com

330

Page 39: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Anchor text containing IBM pointing to www.ibm.com

www.nytimes.com: “IBM acquires Webify”

www.slashdot.org: “New IBM optical chip”

www.stanford.edu: “IBM faculty award recipients”

wwww.ibm.com

Thus: Anchor text is often a better description of a page’scontent than the page itself.

330

Page 40: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Anchor text containing IBM pointing to www.ibm.com

www.nytimes.com: “IBM acquires Webify”

www.slashdot.org: “New IBM optical chip”

www.stanford.edu: “IBM faculty award recipients”

wwww.ibm.com

Thus: Anchor text is often a better description of a page’scontent than the page itself.

Anchor text can be weighted more highly than document text.(based on Assumptions 1&2)

330

Page 41: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

331

Page 42: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.

331

Page 43: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.

Google introduced a new weighting function in 2007 that fixedmany Google bombs.

331

Page 44: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.

Google introduced a new weighting function in 2007 that fixedmany Google bombs.

Still some remnants: [dangerous cult] on Google, Bing, Yahoo

331

Page 45: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.

Google introduced a new weighting function in 2007 that fixedmany Google bombs.

Still some remnants: [dangerous cult] on Google, Bing, Yahoo

Coordinated link creation by those who dislike the Church ofScientology

331

Page 46: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Google bombs

A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.

Google introduced a new weighting function in 2007 that fixedmany Google bombs.

Still some remnants: [dangerous cult] on Google, Bing, Yahoo

Coordinated link creation by those who dislike the Church ofScientology

Defused Google bombs: [dumb motherf....], [who is a failure?],[evil empire]

331

Page 47: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

A historic google bomb

332

Page 48: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

333

Page 49: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

333

Page 50: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

citations in the scientific literature

333

Page 51: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

citations in the scientific literaturehyperlinks on the web

333

Page 52: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

citations in the scientific literaturehyperlinks on the web

Appropriately weighted citation frequency is an excellentmeasure of quality . . .

333

Page 53: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

citations in the scientific literaturehyperlinks on the web

Appropriately weighted citation frequency is an excellentmeasure of quality . . .

. . . both for web pages and for scientific publications.

333

Page 54: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Origins of PageRank: Citation Analysis

We can use the same formal representation (as DAG) for

citations in the scientific literaturehyperlinks on the web

Appropriately weighted citation frequency is an excellentmeasure of quality . . .

. . . both for web pages and for scientific publications.

Next: PageRank algorithm for computing weighted citationfrequency on the web

333

Page 55: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Overview

1 Recap

2 Anchor text

3 PageRank

4 Wrap up

Page 56: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

334

Page 57: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

334

Page 58: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random page

334

Page 59: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably

334

Page 60: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably

In the steady state, each page has a long-term visit rate.

334

Page 61: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably

In the steady state, each page has a long-term visit rate.

This long-term visit rate is the page’s PageRank.

334

Page 62: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Model behind PageRank: Random walk

Imagine a web surfer doing a random walk on the web

Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably

In the steady state, each page has a long-term visit rate.

This long-term visit rate is the page’s PageRank.

PageRank = long-term visit rate = steady state probability

334

Page 63: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

335

Page 64: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .

335

Page 65: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .

state = page

335

Page 66: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .

state = page

At each step, we are on exactly one of the pages.

335

Page 67: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .

state = page

At each step, we are on exactly one of the pages.

For 1 ≤ i , j ≤ N, the matrix entry Pij tells us the probabilityof j being the next page, given we are currently on page i .

335

Page 68: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalisation of random walk: Markov chains

A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .

state = page

At each step, we are on exactly one of the pages.

For 1 ≤ i , j ≤ N, the matrix entry Pij tells us the probabilityof j being the next page, given we are currently on page i .

Clearly, for all i,∑N

j=1 Pij = 1

di dj

Pij

335

Page 69: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Example web graph

336

Page 70: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link matrix for example

d0 d1 d2 d3 d4 d5 d6d0 0 0 1 0 0 0 0d1 0 1 1 0 0 0 0d2 1 0 1 1 0 0 0d3 0 0 0 1 1 0 0d4 0 0 0 0 0 0 1d5 0 0 0 0 0 1 1d6 0 0 0 1 1 0 1

337

Page 71: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Transition probability matrix P for example

d0 d1 d2 d3 d4 d5 d6d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

338

Page 72: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

339

Page 73: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

Recall: PageRank = long-term visit rate

339

Page 74: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

Recall: PageRank = long-term visit rate

Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.

339

Page 75: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

Recall: PageRank = long-term visit rate

Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.

Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?

339

Page 76: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

Recall: PageRank = long-term visit rate

Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.

Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?

The web graph must correspond to an ergodic Markov chain.

339

Page 77: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Long-term visit rate

Recall: PageRank = long-term visit rate

Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.

Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?

The web graph must correspond to an ergodic Markov chain.

First a special case: The web graph must not contain deadends.

339

Page 78: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Dead ends

??

340

Page 79: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Dead ends

??

The web is full of dead ends.

340

Page 80: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Dead ends

??

The web is full of dead ends.

Random walk can get stuck in dead ends.

340

Page 81: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Dead ends

??

The web is full of dead ends.

Random walk can get stuck in dead ends.

If there are dead ends, long-term visit rates are notwell-defined (or non-sensical).

340

Page 82: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

341

Page 83: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

341

Page 84: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).

341

Page 85: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).

With remaining probability (90%), follow a random hyperlinkon the page.

341

Page 86: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).

With remaining probability (90%), follow a random hyperlinkon the page.

For example, if the page has 4 outgoing links: randomly chooseone with probability (1-0.10)/4=0.225

341

Page 87: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).

With remaining probability (90%), follow a random hyperlinkon the page.

For example, if the page has 4 outgoing links: randomly chooseone with probability (1-0.10)/4=0.225

10% is a parameter, the teleportation rate.

341

Page 88: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – to get us out of dead ends

At a dead end, jump to a random web page with prob. 1/N.

At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).

With remaining probability (90%), follow a random hyperlinkon the page.

For example, if the page has 4 outgoing links: randomly chooseone with probability (1-0.10)/4=0.225

10% is a parameter, the teleportation rate.

Note: “jumping” from dead end is independent ofteleportation rate.

341

Page 89: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – formula

P ′ = (1− α) · P + α · T (1)

where T is the teleportation matrix and P is a stochastic matrix

342

Page 90: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – formula

P ′ = (1− α) · P + α · T (1)

where T is the teleportation matrix and P is a stochastic matrix

what is T?

342

Page 91: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – formula

P ′ = (1− α) · P + α · T (1)

where T is the teleportation matrix and P is a stochastic matrix

what is T?

An N × N matrix full of 1/N

342

Page 92: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Teleporting – formula

P ′ = (1− α) · P + α · T (1)

where T is the teleportation matrix and P is a stochastic matrix

what is T?

An N × N matrix full of 1/N

α is the probability of teleporting

342

Page 93: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Result of teleporting

343

Page 94: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Result of teleporting

With teleporting, we cannot get stuck in a dead end.

343

Page 95: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Result of teleporting

With teleporting, we cannot get stuck in a dead end.

But even without dead ends, a graph may not havewell-defined long-term visit rates.

343

Page 96: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Result of teleporting

With teleporting, we cannot get stuck in a dead end.

But even without dead ends, a graph may not havewell-defined long-term visit rates.

More generally, we require that the Markov chain be ergodic.

343

Page 97: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

A Markov chain is ergodic iff it is irreducible and aperiodic.

344

Page 98: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

A Markov chain is ergodic iff it is irreducible and aperiodic.

Irreducibility. Roughly: there is a path from any page to anyother page.

344

Page 99: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

A Markov chain is ergodic iff it is irreducible and aperiodic.

Irreducibility. Roughly: there is a path from any page to anyother page.

Aperiodicity. Roughly: The pages cannot be partitioned suchthat the random walker visits the partitions sequentially.

344

Page 100: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

A Markov chain is ergodic iff it is irreducible and aperiodic.

Irreducibility. Roughly: there is a path from any page to anyother page.

Aperiodicity. Roughly: The pages cannot be partitioned suchthat the random walker visits the partitions sequentially.

A non-ergodic Markov chain:

1.0

1.0

344

Page 101: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

345

Page 102: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

345

Page 103: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

Over a long time period, we visit each state in proportion tothis rate.

345

Page 104: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

Over a long time period, we visit each state in proportion tothis rate.

It doesn’t matter where we start.

345

Page 105: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

Over a long time period, we visit each state in proportion tothis rate.

It doesn’t matter where we start.

Teleporting makes the web graph ergodic.

345

Page 106: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

Over a long time period, we visit each state in proportion tothis rate.

It doesn’t matter where we start.

Teleporting makes the web graph ergodic.

⇒ Web-graph+teleporting has a steady-state probabilitydistribution.

345

Page 107: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Ergodic Markov chains

Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.

This is the steady-state probability distribution.

Over a long time period, we visit each state in proportion tothis rate.

It doesn’t matter where we start.

Teleporting makes the web graph ergodic.

⇒ Web-graph+teleporting has a steady-state probabilitydistribution.

⇒ Each page in the web-graph+teleporting has a PageRank.

345

Page 108: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Where we are

346

Page 109: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Where we are

We now know what to do to make sure we have a well-definedPageRank for each page.

346

Page 110: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Where we are

We now know what to do to make sure we have a well-definedPageRank for each page.

Next: how to compute PageRank

346

Page 111: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

347

Page 112: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.

347

Page 113: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.

Example:( 0 0 0 . . . 1 . . . 0 0 0 )

1 2 3 . . . i . . . N-2 N-1 N

347

Page 114: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.

Example:( 0 0 0 . . . 1 . . . 0 0 0 )

1 2 3 . . . i . . . N-2 N-1 N

More generally: the random walk is on page i with probabilityxi .

347

Page 115: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.

Example:( 0 0 0 . . . 1 . . . 0 0 0 )

1 2 3 . . . i . . . N-2 N-1 N

More generally: the random walk is on page i with probabilityxi .

Example:( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )

1 2 3 . . . i . . . N-2 N-1 N

347

Page 116: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Formalization of “visit”: Probability vector

A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.

Example:( 0 0 0 . . . 1 . . . 0 0 0 )

1 2 3 . . . i . . . N-2 N-1 N

More generally: the random walk is on page i with probabilityxi .

Example:( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )

1 2 3 . . . i . . . N-2 N-1 N∑

xi = 1

347

Page 117: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Change in probability vector

If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?

348

Page 118: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Change in probability vector

If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?

Recall that row i of the transition probability matrix P tells uswhere we go next from state i .

348

Page 119: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Change in probability vector

If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?

Recall that row i of the transition probability matrix P tells uswhere we go next from state i .

So from ~x , our next state is distributed as ~xP .

348

Page 120: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady state in vector notation

349

Page 121: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady state in vector notation

The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.

349

Page 122: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady state in vector notation

The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.

(We use ~π to distinguish it from the notation for theprobability vector ~x .)

349

Page 123: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady state in vector notation

The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.

(We use ~π to distinguish it from the notation for theprobability vector ~x .)

πi is the long-term visit rate (or PageRank) of page i .

349

Page 124: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady state in vector notation

The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.

(We use ~π to distinguish it from the notation for theprobability vector ~x .)

πi is the long-term visit rate (or PageRank) of page i .

So we can think of PageRank as a very long vector – oneentry per page.

349

Page 125: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

350

Page 126: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

What is the PageRank / steady state in this example?

d1 d2

0.75

0.25

0.250.75

350

Page 127: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0t1

351

Page 128: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1

351

Page 129: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

351

Page 130: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25

351

Page 131: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25

351

Page 132: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25Pt(d2) = Pt−1(d1) · P12 + Pt−1(d2) · P22

351

Page 133: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25Pt(d2) = Pt−1(d1) · P12 + Pt−1(d2) · P22

0.75 · 0.25 + 0.75 · 0.75 = 0.75

351

Page 134: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25 0.75

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25Pt(d2) = Pt−1(d1) · P12 + Pt−1(d2) · P22

0.75 · 0.25 + 0.75 · 0.75 = 0.75

351

Page 135: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25 0.75 (convergence)

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25Pt(d2) = Pt−1(d1) · P12 + Pt−1(d2) · P22

0.75 · 0.25 + 0.75 · 0.75 = 0.75

351

Page 136: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Steady-state distribution: Example

x1 x2Pt(d1) Pt(d2)

P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75

t0 0.25 0.75t1 0.25 0.75 (convergence)

Pt(d1) = Pt−1(d1) · P11 + Pt−1(d2) · P21

0.25 · 0.25 + 0.75 · 0.25 = 0.25Pt(d2) = Pt−1(d1) · P12 + Pt−1(d2) · P22

0.75 · 0.25 + 0.75 · 0.75 = 0.75

PageRank vector = ~π = (π1, π2) = (0.25, 0.75)

351

Page 137: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

352

Page 138: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

352

Page 139: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

352

Page 140: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

352

Page 141: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

352

Page 142: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

So: ~π = ~πP

352

Page 143: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

So: ~π = ~πP

Solving this matrix equation gives us ~π.

352

Page 144: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

So: ~π = ~πP

Solving this matrix equation gives us ~π.

~π is the principal left eigenvector for P . . .

352

Page 145: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

So: ~π = ~πP

Solving this matrix equation gives us ~π.

~π is the principal left eigenvector for P . . .

. . . that is, ~π is the left eigenvector with the largest eigenvalue.

352

Page 146: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How do we compute the steady state vector?

In other words: how do we compute PageRank?

Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .

. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .

But ~π is the steady state!

So: ~π = ~πP

Solving this matrix equation gives us ~π.

~π is the principal left eigenvector for P . . .

. . . that is, ~π is the left eigenvector with the largest eigenvalue.

All transition probability matrices have largest eigenvalue 1.

352

Page 147: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

353

Page 148: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

353

Page 149: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

353

Page 150: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

353

Page 151: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

After k steps, we’re at ~xPk .

353

Page 152: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

After k steps, we’re at ~xPk .

Algorithm: multiply ~x by increasing powers of P untilconvergence.

353

Page 153: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

After k steps, we’re at ~xPk .

Algorithm: multiply ~x by increasing powers of P untilconvergence.

This is called the power method.

353

Page 154: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

After k steps, we’re at ~xPk .

Algorithm: multiply ~x by increasing powers of P untilconvergence.

This is called the power method.

Recall: regardless of where we start, we eventually reach thesteady state ~π.

353

Page 155: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

One way of computing the PageRank ~π

Start with any distribution ~x , e.g., uniform distribution

After one step, we’re at ~xP .

After two steps, we’re at ~xP2.

After k steps, we’re at ~xPk .

Algorithm: multiply ~x by increasing powers of P untilconvergence.

This is called the power method.

Recall: regardless of where we start, we eventually reach thesteady state ~π.

Thus: we will eventually (in asymptotia) reach the steadystate.

353

Page 156: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

354

Page 157: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 = ~xPt1 = ~xP2

t2 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 158: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 = ~xP2

t2 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 159: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 = ~xP2

t2 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 160: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 161: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 162: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 163: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 164: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 0.2496 0.7504 = ~xP4

. . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 165: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 0.2496 0.7504 = ~xP4

. . . . . .t∞ = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 166: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 0.2496 0.7504 = ~xP4

. . . . . .t∞ 0.25 0.75 = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 167: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 0.2496 0.7504 = ~xP4

. . . . . .t∞ 0.25 0.75 0.25 0.75 = ~xP∞

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 168: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Computing PageRank: Power method

x1 x2Pt(d1) Pt(d2)

P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7

t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2

t2 0.24 0.76 0.252 0.748 = ~xP3

t3 0.252 0.748 0.2496 0.7504 = ~xP4

. . . . . .t∞ 0.25 0.75 0.25 0.75 = ~xP∞

PageRank vector = ~π = (π1, π2) = (0.25, 0.75)

Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21

Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22

354

Page 169: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

355

Page 170: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

355

Page 171: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

355

Page 172: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)

355

Page 173: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter α

355

Page 174: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π

355

Page 175: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π~πi is the PageRank of page i .

355

Page 176: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π~πi is the PageRank of page i .

Query processing

355

Page 177: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π~πi is the PageRank of page i .

Query processing

Retrieve pages satisfying the query

355

Page 178: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π~πi is the PageRank of page i .

Query processing

Retrieve pages satisfying the queryRank them by their PageRank (or at least a combination ofPageRank and the relevance score)

355

Page 179: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank summary

Preprocessing

Given graph of links, build initial matrix P

Ensure all rows sum to 1.0 to update P (for nodes with nooutgoing links use 1/N for each element)Apply teleportation with parameter αFrom modified matrix, compute ~π~πi is the PageRank of page i .

Query processing

Retrieve pages satisfying the queryRank them by their PageRank (or at least a combination ofPageRank and the relevance score)Return reranked list to the user

355

Page 180: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

356

Page 181: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

356

Page 182: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!

356

Page 183: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.

356

Page 184: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

356

Page 185: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

356

Page 186: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

Consider the query [video service]

356

Page 187: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.

356

Page 188: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.

356

Page 189: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.Clearly not desirable

356

Page 190: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

PageRank issues

Real surfers are not random surfers.

Examples of non-random surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.

Simple PageRank ranking (as described on previous slide)produces bad results for many pages.

Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.Clearly not desirable

In practice: rank according to weighted combination of rawtext match, anchor text match, PageRank & other factors

356

Page 191: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Example web graph

357

Page 192: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Transition (probability) matrix

358

Page 193: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Transition (probability) matrix

d0 d1 d2 d3 d4 d5 d6d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

358

Page 194: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Transition matrix with teleporting (α = 0.14)

359

Page 195: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Transition matrix with teleporting (α = 0.14)

d0 d1 d2 d3 d4 d5 d6d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31

359

Page 196: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Power method vectors ~xPk

360

Page 197: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Power method vectors ~xPk

~x ~xP1 ~xP2 ~xP3 ~xP4 ~xP5 ~xP6 ~xP7 ~xP8 ~xP9 ~xP10 ~xP11 ~xP12 ~xP13

d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31

360

Page 198: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Example web graph

361

Page 199: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Example web graph

361

Page 200: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Example web graph

PageRank

d0 0.05d1 0.04d2 0.11d3 0.25d4 0.21d5 0.04d6 0.31

PageRank(d2) <PageRank(d6): why?

361

Page 201: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

362

Page 202: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking.

362

Page 203: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

362

Page 204: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

362

Page 205: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

Rumour has it that PageRank in its original form (aspresented here) now has a negligible impact on ranking

362

Page 206: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

Rumour has it that PageRank in its original form (aspresented here) now has a negligible impact on ranking

However, variants of a page’s PageRank are still an essentialpart of ranking.

362

Page 207: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

Rumour has it that PageRank in its original form (aspresented here) now has a negligible impact on ranking

However, variants of a page’s PageRank are still an essentialpart of ranking.

Google’s official description of PageRank:

362

Page 208: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

Rumour has it that PageRank in its original form (aspresented here) now has a negligible impact on ranking

However, variants of a page’s PageRank are still an essentialpart of ranking.

Google’s official description of PageRank:

“PageRank reflects our view of the importance of web pages by considering

more than 500 million variables and 2 billion terms. Pages that we believe are

important pages receive a higher PageRank and are more likely to appear at

the top of the search results.”

362

Page 209: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

How important is PageRank?

Frequent claim: PageRank is the most important component ofweb ranking. The reality:

There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .

Rumour has it that PageRank in its original form (aspresented here) now has a negligible impact on ranking

However, variants of a page’s PageRank are still an essentialpart of ranking.

Google’s official description of PageRank:

“PageRank reflects our view of the importance of web pages by considering

more than 500 million variables and 2 billion terms. Pages that we believe are

important pages receive a higher PageRank and are more likely to appear at

the top of the search results.”

Adressing link spam is difficult and crucial.

362

Page 210: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Overview

1 Recap

2 Anchor text

3 PageRank

4 Wrap up

Page 211: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link Analysis

363

Page 212: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link Analysis

PageRank is topic independent

363

Page 213: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link Analysis

PageRank is topic independent

We also need to incorporate topicality (i.e. relevance)

363

Page 214: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link Analysis

PageRank is topic independent

We also need to incorporate topicality (i.e. relevance)

There is a version called Topic Sensitive PageRank

363

Page 215: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Link Analysis

PageRank is topic independent

We also need to incorporate topicality (i.e. relevance)

There is a version called Topic Sensitive PageRank

And also Hyperlink-Induced Topic Search (HITS)

363

Page 216: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

364

Page 217: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

Anchor text is a useful descriptor of the page it refers to

364

Page 218: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

Anchor text is a useful descriptor of the page it refers to

Links can be used as another useful retrieval signal - oneindicating authority

364

Page 219: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

Anchor text is a useful descriptor of the page it refers to

Links can be used as another useful retrieval signal - oneindicating authority

PageRank can be viewed as the stationary distribution of aMarkov chain

364

Page 220: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

Anchor text is a useful descriptor of the page it refers to

Links can be used as another useful retrieval signal - oneindicating authority

PageRank can be viewed as the stationary distribution of aMarkov chain

Power iteration is one simple method of calculating thestationary distribution

364

Page 221: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Take Home Messages

Anchor text is a useful descriptor of the page it refers to

Links can be used as another useful retrieval signal - oneindicating authority

PageRank can be viewed as the stationary distribution of aMarkov chain

Power iteration is one simple method of calculating thestationary distribution

Topic sensitive variants exist

364

Page 222: Lecture 8: Linkage algorithms and web search · : “New IBM optical chip” : “IBM faculty award recipients” Thus: Anchor text is often a better description of a page’s content

Reading

MRS Chapter 21, excluding 21.3.3.

365