random walks, eigenvectors, and their applications to information retrieval, natural language...

65
Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University of Michigan [email protected] Guest lecture in SI 614 March 7, 2006

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and

Machine Learning

Dragomir R. RadevUniversity of Michigan

[email protected]

Guest lecture in SI 614March 7, 2006

Page 2: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

INTRODUCTION

Page 3: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Social networks

• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW

Page 4: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Prestige and centrality

• Degree centrality: how many neighbors each node has.

• Closeness centrality: how close a node is to all of the other nodes

• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.

Page 5: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

MARKOV CHAINSAND

RANDOM WALKS

Page 6: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

1-d random walks

• Drunkard’s walk:– Start at position 0 on a line

• What is the prob. of reaching 0 before reaching 5? Same for penny matching.

• Harmonic functions:– P(0) = 0– P(N) = 1– P(x) = 1/2p(x-1)+1/2p(x+1), for 0<x<N

0 1 2 3 4 5

Page 7: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix

Graph G (V,E)

Page 8: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.

• Path = sequence (x0, x1, …, xn).Xi = xi-1*E

• The probability of a path can be computed as a product of probabilities for each step i.

• Random walk = find Xj given x0, E, and j.

Page 9: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic

– E is irreducible

– E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)

– Make sure that E is strongly connected

– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”

Page 10: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

1

2

34

5

7

6 8

Example

This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1

Page 11: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

EIGENVALUESAND

EIGENVECTORS

Page 12: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Eigenvectors and eigenvalues

• An eigenvector is an implicit “direction” for a matrix

where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle

• Computing eigenvalues:

• Example

0)det( IA

vvA

02

31A

Page 13: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Stochastic matrices

• Stochastic matrices: each row (or column) adds up to 1 and no value is less than 0. Example:

• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.

43

41

85

83

A

Page 14: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)

end

Solution for thestationary distribution

Page 15: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=10

Page 16: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

PAGERANKANDHITS

Page 17: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

PageRank

• Named after Larry Page, co-founder of Google (and U-M graduate).

• Imagine a random walk on a strongly connected Web graph.

• Aimless surfer will reach any page after a high number of steps.

• Visiting “prestigious pages” increases the speed of convergence.

Page 18: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Prestige

• Adjacency matrix E where E[i, j]=1 if document i cites document j.

• Every node has a prestige value p[v]

pEp T'

uu

T upvuEupuvEvp ][],[][],[]['

Page 19: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

PageRank

• Described in “The anatomy of a large-scale hypertextual web search engine” by Brin and Page (WWW1998)

• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

Page 20: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Co-citation

• If document u cites both v and w, then v and w are co-cited.

uu

TT wuEvuEwuEuvEwvEE ],[],[],[],[],)[(

|}),(;),(:{| EwuEvuu

• The entry E(u,w) in the (ETE) matrix is the co-citation index of v and w.

Page 21: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

HITS

• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)

• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs

• Currently used in Teoma

hEa T'Eah '

Page 22: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Some pointers

• http://jung.sourceforge.net/applet/rankingdemo.html

• Highest pagerank scores:http://en.wikipedia.org/wiki/List_of_websites_with_a_high_PageRank

• http://www.pagerank.dk/• http://www.scriptet.com/improve-pagerank.

html• http://en.wikipedia.org/wiki/Page_rank

Page 23: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

LEXICAL CENTRALITY

Erkan and Radev 2004

Page 24: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Centrality in summarization

• Extractive summarization (pick k sentences that are most representative of a collection of n sentences

• Motivation: capture the most central words in a document or cluster

• Centroid score [Radev & al. 2000, 2004a]• Alternative methods for computing centrality?

Page 25: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Sample multidocument cluster

1 (d1s1) Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met.

2 (d2s1) Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990.

3 (d2s2) Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it.

4 (d2s3) Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation.

5 (d3s1) The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area.

6 (d3s2) Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.''

7 (d3s3) Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM).

8 (d4s1) The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors.

9 (d5s1) British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.''

10 (d5s2) In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations.

11 (d5s3) A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.

(DUC cluster d1003t)

Page 26: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Cosine between sentences

• Let s1 and s2 be two sentences.

• Let x and y be their representations in an n-dimensional vector space

• The cosine between is then computed based on the inner product of the two.

yx

yx

yx niii

,1),cos(

• The cosine ranges from 0 to 1.

Page 27: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

LexRank (Cosine centrality)

1 2 3 4 5 6 7 8 9 10 11

1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00

2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00

3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00

4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01

5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18

6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03

7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01

8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17

9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38

10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12

11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00

Page 28: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Lexical centrality (t=0.3)

Page 29: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2d5s3

d5s1

d3s3

Lexical centrality (t=0.2)

Page 30: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

d4s1

d1s1

d3s2

d3s1

d2s3d3s3

d2s1

d2s2

d5s2d5s3

d5s1

Lexical centrality (t=0.1)

Sentences vote for the most central sentence!Need to worry about diversity reranking.

d4s1

d3s2

d2s1

Page 31: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

N

dTiTETp

Tc

dTiTETp

Tc

dTip nn

n

1)()(

)(...)()(

)()( ,,11

1

LexRank

• T1…Tn are pages that link to A, c(Ti) is the outdegree of pageTi, and N is the total number of pages.

• d is the “damping factor”, or the probability that we “jump” to a far-away node during the random walk. It accounts for disconnected components or periodic graphs.

• When d = 0, we have a strict uniform distribution.When d = 1, the method is not guaranteed to converge to a unique solution.

• Typical value for d is between [0.1,0.2] (Brin and Page, 1998).

Page 32: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Lexrank demo

Page 33: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

BIASED LEXRANK

Otterbacher, Erkan and Radev 2005

Page 34: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN. The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina. The building houses government offices and is next to the city's central train station. Several storeys of the building were engulfed in fire, she said. Italian TV says the crash put a hole in the 25th floor of the Pirelli building, and that smoke is pouring from the opening. Police and ambulances are at the scene. Many people were on the streets as they left work for the evening at the time of the crash. Police were trying to keep people away, and many ambulances were on the scene. There is no word yet on casualties.

CNN 4/18/02 12:22pm; CNN 4/18/02 12:32pm; ABCNews 4/18/02 1:00pm;MSNBC 4/18/02 1:00pm; La Stampa 4/18/02 12:45pm

A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN. The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina. The building houses government offices and is next to the city's central train station. Several storeys of the building were engulfed in fire, she said. Italian TV showed a hole in the side of the Pirelli building with smoke pouring from the opening. RAI state TV reported that the plane had apparently radioed an SOS because of engine trouble. Earlier though, in Rome, the senate's president, Marcello Pera, said it "very probably" appeared to be a terrorist attack. Police and ambulances are at the scene. Many people were on the streets as they left work for the evening at the time of the crash. Police were trying to keep people away, and many ambulances were on the scene. There is no word yet on casualties. TV pictures from the scene evoked horrific memories of the September 11 attacks on the World Trade Center in New York and the collapse of the building's twin towers. "I heard a strange bang so I went to the window and outside I saw the windows of the Pirelli building blown out and then I saw smoke coming from them," said Gianluca Liberto, an engineer who was working in the area told Reuters. The building is known as the Pirelli skyscraper but the Italian tyre and cable company does not operate out of the building. It is one of the symbols of Italy's financial capital and is one of the world's tallest concrete buildings, designed between 1955 and 1960.

A small plane crashed into a skyscraper in downtown Milan today, setting several floors of the 30-story building on fire. The plane crashed into the 25th floor of the Pirelli building in downtown Milan. The weather was clear at the time of the crash. Smoke poured from the opening as police and ambulances rushed to the area. The president of the Italian Senate, Marcello Pera, told Italian television it "very probably" appeared to be a terrorist attack but soon afterwards his spokesman said it was probably an accident. A transport official told Reuters the plane had reported problems with its undercarriage and was circling the city ahead of trying to land at a local airport. The Pirelli building houses the administrative offices of the local Lombardy region and sits next to the city's central train station. It is constructed of concrete and glass. The crash happened just before rush hour, as office workers were closing their day.

A small airplane crashed into a government building in heart of Milan, setting the top floors on fire, Italian police reported. There were no immediate reports on casualties as rescue workers attempted to clear the area in the city’s financial district. Few details of the crash were available, but news reports about it immediately set off fears that it might be a terrorist act akin to the Sept. 11 attacks in the United States. Those fears sent U.S. stocks tumbling to session lows in late morning trading. Witnesses reported hearing a loud explosion from the 30-story office building, which houses the administrative off ices of the local Lombardy region and sits next to the city s central train station. Italian state television said the crash put a hole in the 25th floor of the Pirelli building. News reports said smoke poured from the opening. Police and ambulances rushed to the building in downtown Milan. No further details were immediately available.

Un aereo da turismo, un Piper si è schiantato questo pomeriggio a Milano, poco prima delle 18, contro il grattacielo Pirelli, sede anche della Regione Lombardia (il presidente della Regione, Roberto Formigoni, è in missione ufficiale in India con una delegazione della regione). Lo si è appreso in ambienti investigativi. L' impatto sarebbe avvenuto attorno al 25/o piano dei 30 del grattacielo. Almeno sei piani alla vista risultano sventrati. I detriti sono stati lanciati dal'esplosione a una quarantina di metri intorno all'edificio. In tutta l'area attorno al grattacielo Pirelli lecomunicazioni telefoniche anche via cellulare sono interrotte o quasi impossibili. La Borsa ha sospeso la seduta serale a Piazza Affari dopo lo schianto dell'aereo da turismo, anche il presidente Bush è stato subito avvertito dell'espolosione al Pirellone.«Con molta probabilità si tratta di un attentato». Lo ha detto Marcello Pera aprendo la seduta a Palazzo Madama. Ma secondo quanto si è appreso, l'aereo da turismo era probabilmente in avaria: il pilota, infatti, avrebbe lanciato l'SOS, raccolto dalla torre di controllo di Linate.

Page 35: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Questions from the Milan cluster

1. How many people were injured?2. How many people were killed? (age, number, gender, description)3. Was the pilot killed?4. Where was the plane coming from?5. Was it an accident (technical problem, illness, terrorist act)? 6. Who was the pilot? (age, number, gender, description) 7. When did the plane crash? 8. How tall is the Pirelli building? 9. Who was on the plane with the pilot? 10. Did the plane catch fire before hitting the building? 11. What was the weather like at the time of the crash? 12. When was the building built? 13. What direction was the plane flying? 14. How many people work in the building? 15. How many people were in the building at the time of the crash? 16. How many people were taken to the hospital? 17. What kind of aircraft was used?

Page 36: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Protein Regulatory Network Recognition

• Wnt signaling• Glycogen synthase kinase-3

(GSK-3) and CK1 (casein kinase 1) alpha phosphorylate Arm (Armadillo, -catenin) and cause it to degrade.

• Axin also binds to the phosphatase PP2A

• PP2A activity inhibits Wnt signaling

Hsu 1999, Li 2001, Yanagawa 2002, Liu2002, Nusse 2003

Page 37: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Biased lexrank

• Diversity-based summaries (cf. Carbonell&Goldstein)• Query-based summaries: Given: a cluster of documents + a set

of sample sentences that express certain facts (e.g., protein interactions or answers to questions like “What type of aircraft was involved?)

Page 38: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Question-focused sentence retrieval

Page 39: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University
Page 40: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Example

Page 41: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

RW METHODS FORCLASSIFICATION

Radev 2004

Page 42: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

PP attachment

• High vs. low attachmentV x02_join x01_board x0_as x11_director N x02_is x01_chairman x0_of x11_entitynam N x02_name x01_director x0_of x11_conglomer N x02_caus x01_percentag x0_of x11_death V x02_us x01_crocidolit x0_in x11_filter V x02_bring x01_attent x0_to x11_problem

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate.

A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , researchers reported . The asbestos fiber , crocidolite , is unusually resilient once it enters the lungs , with even brief exposures to it causing symptoms that show up decades later , researchers said . Lorillard Inc. , the unit of New York-based Loews Corp. that makes Kent cigarettes , stopped using crocidolite in its Micronite cigarette filters in 1956 . Although preliminary findings were reported more than a year ago , the latest results appear in today 's New England Journal of Medicine , a forum likely to bring new attention to the problem .

Page 43: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Electrical networks and random walks

y

xyx CC

• Ergodic (connected) Markov chain with transition matrix P

1 Ω1 Ω

1 Ω 0.5 Ω

0.5 Ωa b

c

d

xyxy R

C1

x

xyxy C

CP

05

2

5

2

5

12

10

4

1

4

13

2

3

100

2

1

2

100

dcba

a

b

c

d

w=Pw T

14

514

414

314

2

From Doyle and Snell 2000

Page 44: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Electrical networks and random walks

xyyxxy

yxxy Cvv

R

vvi )(

yy

xyyy x

xyx vPv

c

cv

1 Ω1 Ω

1 Ω 0.5 Ω

0.5 Ωa

c

d

1 V

b

y

xyi 0

0

1

b

a

v

v

• vx is the probability that a random walk starting at x will reach a before reaching b.

• The random walk interpretation allows us to use Monte Carlo methods to solve electrical circuits.

8

3

5

2

5

116

7

2

1

4

1

cd

dc

vv

vv

Page 45: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Example

reported earnings for quarter

reported loss for quarter

posted loss for quarter

posted loss of quarter

posted loss of million

V

?

?

?

N*

Page 46: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Example

V

N

n1

p

v

reported earnings for quarter

posted loss of million

posted earnings for quarter

n2

Page 47: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University
Page 48: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

TUMBL

Page 49: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

TUMBL

Page 50: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

ADDITIONALREFERENCES

Page 51: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

• Wu and Huberman 2004. Finding communities in linear time: a physics approach. The European Physics Journal B, 38:331--338

• Kurland and Lee 2005 – random walks with generation probabilities

• Erkan 2006 – random walk based clustering• Zhu and Ghahramani – work on ML methods on graphs• Doyle and Snell – random walks and electric networks• Large bibliography:

http://tangra.si.umich.edu/~radev/webgraph/bibliography.pdf

Page 52: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

EXTRA SLIDES

Page 53: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Cosine centrality vs. centroid centrality

ID LPR (0.1) LPR (0.2) LPR (0.3) Centroid

d1s1 0.6007 0.6944 1.0000 0.7209

d2s1 0.8466 0.7317 1.0000 0.7249

d2s2 0.3491 0.6773 1.0000 0.1356

d2s3 0.7520 0.6550 1.0000 0.5694

d3s1 0.5907 0.4344 1.0000 0.6331

d3s2 0.7993 0.8718 1.0000 0.7972

d3s3 0.3548 0.4993 1.0000 0.3328

d4s1 1.0000 1.0000 1.0000 0.9414

d5s1 0.5921 0.7399 1.0000 0.9580

d5s2 0.6910 0.6967 1.0000 1.0000

d5s3 0.5921 0.4501 1.0000 0.7902

Page 54: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Evaluation metrics

• Difficult to evaluate summaries– Intrinsic vs. extrinsic evaluations

– Extractive vs. non-extractive evaluations

– Manual vs. automatic evaluations

• ROUGE = n-gram recall for different values of n.• Example:

– Reference = “The cat in the hat”

– System = “The cat wears a top hat”

– 1-gram recall = 3/5; 2-gram recall = 1/4;3,4-gram recall = 0

• ROUGE-W = longest common subsequence• Example above: 3/5

Page 55: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

CODE ROUGE-1 ROUGE-2 ROUGE-W

C0.5 0.39013 0.10459 0.12202

C10 0.38539 0.10125 0.11870

C1.5 0.38074 0.09922 0.11804

C1 0.38181 0.10023 0.11909

C2.5 0.37985 0.10154 0.11917

C2 0.38001 0.09901 0.11772

Degree0.5T0.1 0.39016 0.10831 0.12292

Degree0.5T0.2 0.39076 0.11026 0.12236

Degree0.5T0.3 0.38568 0.10818 0.12088

Degree1.5T0.1 0.38634 0.10882 0.12136

Degree1.5T0.2 0.39395 0.11360 0.12329

Degree1.5T0.3 0.38553 0.10683 0.12064

Degree1T0.1 0.38882 0.10812 0.12286

Degree1T0.2 0.39241 0.11298 0.12277

Degree1T0.3 0.38412 0.10568 0.11961

Lpr0.5T0.1 0.39369 0.10665 0.12287

Lpr0.5T0.2 0.38899 0.10891 0.12200

Lpr0.5t0.3 0.38667 0.10255 0.12244

Lpr1.5t0.1 0.39997 0.11030 0.12427

Lpr1.5t0.2 0.39970 0.11508 0.12422

Lpr1.5t0.3 0.38251 0.10610 0.12039

Lpr1T0.1 0.39312 0.10730 0.12274

Lpr1T0.2 0.39614 0.11266 0.12350

Lpr1T0.3 0.38777 0.10586 0.12157

Centroid

Degree

LexPageRank

Page 56: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Evaluation results

Centroid: C0.5, C10, C1.5, C1, C2.5, C2

Degree: D0.5T0.1, D0.5T0.2, D0.5T0.3, D1.5T0.1, D1.5T0.2, D1.5T0.3, D1T0.1, D1T0.2, D1T0.3

LexRank: Lr0.5T0.1, Lr0.5T0.2, Lr0.5t0.3, Lr1.5t0.1, Lr1.5t0.2, Lr1.5t0.3, Lr1T0.1, Lr1T0.2, Lr1T0.3

Rouge-2Lr1.5t0.2 0.115 D1.5T0.2 0.114D1T0.2 0.113…C1.5 0.099

Rouge-1Lr1.5t0.1 0.400Lr1.5t0.2 0.400Lr1T0.2 0.396…C1 0.382

Rouge-4Lr1.5t0.1 0.124Lr1.5t0.2 0.124Lr1T0.2 0.124…C2 0.118

Page 57: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

DUC 2004 results

Peer code

Task ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4

141 3 5 2 1 1

142 3 5 1 1 1

143 4 1 2 1 1

144 4 3 1 1 1

145 4 1 2 2 2

Page 58: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Relevance

qw

wqwsw idftftfqsrel *)1log(*)1log()|( ,,

w

w sf

N

5.0

1logidf

)|(),(

),()1(

)|(

)|()|( qvv

vzsim

vssimd

qzrel

qsreldqsv

CvCzCz

vBAv Tdd ])1([

Page 59: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Corpus

• 20 clusters: 11+3+6

• 341 total questions

• Interjudge agreement: Kappa = 0.68 (with 2 judges): does sentence X contain the answer to question Y.

Page 60: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

TRDR Results

• TRDR = total reciprocal document rank

• Baseline: 0.867

• Mixture model: 0.991 (p-value = 0.062)

• For similarity = 0.20 and bias = 0.95 (estimated on the devtest)

Page 61: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Some statisticsV N V(%)

TOTAL 9936 10865 47.77%

of 50 5527 0.90%

in 1948 1552 55.66%

to 2172 501 81.26%

for 1136 1045 52.09%

on 666 549 54.81%

from 644 292 68.80%

with 605 329 64.78%20801 training, 3097 test, 66 different prepositions

Page 62: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Baselines and related work

• Always N: 59.0%• Based on preposition only: 72.2%

• TBL [Brill & Resnik 94]: 81.8%• Backoff [Collins & Brooks 95]: 84.5 %• Boosting [Abney & al. 99]: 84.6%• Dependency-based nearest neighbors [Zhao & Lin 04]:

86.5 %• 3-hop random walk using wordnet and external noisy

corpus [Toutanova & al. 04]: 87.5%

• Human (4 words only): 88.2 %• Human (whole sent): 93.2 %

Page 63: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Current results

• PP attachment (full; 4039 test data points)

number of labeled examples

Backoff TUMBL

2000 0.797 0.801

10000 0.824 0.816

20801 0.843 0.842

Page 64: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University
Page 65: Random walks, eigenvectors, and their applications to Information Retrieval, Natural Language Processing, and Machine Learning Dragomir R. Radev University

Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology