İrem arıkan, srikanta bedathur, klaus berberich time will tell: leveraging temporal expressions in...

38
İ rem Ar ıkan , Srikanta Bedathur, Klaus Berberich Time Will Tell: Leveraging Temporal Expressions in IR

Upload: bernice-bates

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

İrem Arıkan, Srikanta Bedathur, Klaus Berberich

Time Will Tell:

Leveraging Temporal Expressions in IR

Motivation

Documents contain temporal information in the form of temporal expressions

Documents contain temporal information in the form of temporal expressions

Motivation

Users have temporal information needs

Query: Prime Minister United Kingdom 2000

Motivation

Users have temporal information needs

Query: Prime Minister United Kingdom 2000

PROBLEMTraditional information retrieval systems do not exploit the temporal content in documents

Temporal expressions are more than common terms

Motivation

Users have temporal information needs

Query: Prime Minister United Kingdom 2000

PROBLEMTraditional information retrieval systems do not exploit the temporal content in documents

OUR APPROACH

Integrates temporal dimension into a language model based retrieval framework

Temporal expressions are more than common terms

Motivation

Motivation

Model

Our Approach

Experimental Evaluation

Outline

Document d = { dtext ,dtemp }

dtext : a bag of textual terms

dtemp : a bag of temporal expressions

Document Model

Document d = { dtext ,dtemp }

dtext : a bag of textual terms

dtemp : a bag of temporal expressions

a temporal expression is considered as a time interval T = [ begin, end ]

begin end0

T[ ]

Document Model

Query q = { qtext ,qtemp }

qtext : set of textual terms

qtemp : set of temporal expressions

Prime Minister United Kingdom 2000

qtempqtext

Query Model

Motivation

Model

Our Approach

Filtering Approach

Weighted Approach

Experimental Evaluation

Outline

Our Baseline: Ponte and Croft‘s Model (LM)

Each document has a language model associated

Query is a random process

Documents are ranked according to the likelihood that the query would be generated by the language model estimated for each document

textqw

texttexttext dwPdqP )|(~)|(

Filtering Approach (LMF)

Idea: Discard all documents that do not contain any temporal expression relevant to the user‘s query

t

Filtering Approach

Idea: Discard all documents that do not contain any temporal expression relevant to the user‘s query

our definition of temporal relevance

only relevant, if it overlaps with a temporal expression from the query

t

28 Nov 1990 - 2 May 1997

2 May 1997 – 27 June 2007

2000

begin end

query

Filtering Approach

Idea: Discard all documents that do not contain any relevant temporal expressions to user‘s query

our definition of temporal relevance

only relevant, if it overlaps with a temporal expression from the query

t

28 Nov 1990 - 2 May 1997

2 May 1997 – 27 June 2007

begin end

Relevant

X Irrelevant

2000 query

Problem: has a black-and-white view of the world

Does not take into account

how many relevant temporal expressions a document contains

how closely they match the temporal expressions specified in the user‘s query

Filtering Approach

Problem: has a black-and-white view of the world

Does not take into account

how many relevant temporal expressions a document contains

how closely they match the temporal expressions specified in the user‘s query

query: 1980 – 1990

1980 – 1989 is more relevant than 23 March 1984

Filtering Approach

Idea: Assign higher relevance to a document, if it contains more temporal expressions that match more closely to the temporal expressions from the user‘s query

Weighted Approach (LMW)

Idea: Assign higher relevance to a document, if it contains more temporal expressions that match more closely to the temporal expressions from the user‘s query

We assume that qtext and qtemp are produced independently

)|()|()|( temptemptexttext dqPdqPdqP

Weighted Approach

Idea: Assign higher relevance to a document, if it contains more temporal expressions that match more closely to the temporal expressions from the user‘s query

We assume that qtext and qtemp are produced independently

Temporal expressions occur independently

)|()|()|( temptemptexttext dqPdqPdqP

tempqQ

temptemptemp dQPdqP )|()|(

Weighted Approach

Each temporal expression T in d is a sample from a different generative model

Weighted Approach

Each temporal expression T in d is a sample from a different generative model

Generating a temporal expression Q = [qBegin, qEnd] given dtemp

1. draw a single temporal expression T=[dBegin, dEnd] at uniform from d

2. generate Q by the generative model that is associated with T

Weighted Approach

Each temporal expression T in d is a sample from a different generative model

Generating a temporal expression Q = [qBegin, qEnd] given dtemp

1. draw a single temporal expression T=[dBegin, dEnd] at uniform from d

2. generate Q by the generative model that is associated with T

The likelihood of generating Q by the set of generative models that

produced dtemp

tempdTtemp

temp TQPd

dQP )|(1

)|(

Weighted Approach

Generate Q = [qBegin, qEnd] from the query by the generative model that is associated with T = [dBegin, dEnd] from a document

dEnd dEnd+α(dEnd-dbegin)

)|()()|( qBeginqEndPqBeginPTQP

dBegin dEnddBegin-α(dEnd-dBegin)

qBegin qEnd

P(qBegin) P(qEnd|qBegin)

Weighted Approach

qBegin

Generate Q = [qBegin, qEnd] from the query by the generative model that is associated with T = [dBegin, dEnd] from a document

dEnd dEnd + α(dEnd-dbegin)dBegin dEnddBegin - α(dEnd-dBegin)

qBegin qEnd

P(qBegin) P(qEnd|qBegin)

Weighted Approach

qBegin

produces only relevant temporal expressions of T

P(Q|T) gets smaller as the length of their overlap decreases

)|()()|( qBeginqEndPqBeginPTQP

Motivation

Model

Our Approach

Experimental Evaluation

Outline

Dataset

HTML snapshot of English Wikipedia from May 2007 containing

~ 2M documents

Implementation

Terrier Information Retrieval Platform:

provides an implementation of Ponte & Croft's approach

LMF, LMW

Java + MySQL

A set of regular expressions for extracting temporal information

Experimental Evaluation

Anectodal query results - 1

LM LMF LMW

1 Art in Puerto Rico Jose del Castillo Jose del Castillo

2 Spanish Art List of Spanish Artists Roybal

3 Plazzo Bianco(Genoa) Roybal Augustine Esteve

4 Caprichos Augustine Esteve Maldonado

5 Portrait Painting Francisco Eduardo Tresguerras Luis Egidio Melendez

Spanish painter 18th century

Experimental Evaluation

Anectodal query results - 2

LM LMF LMW

1 Battle of Dunbar(1650) List of Norwegian Battles Battle of Gabbard

2 Monte Mataiur Battle of Portland Battle of Portland

3 St. George Caye Action of 22 February 1812 Battle of Schveningen

4 Culrain Scottland Naval Strategy Battle of Kentish Knock

5 First Anglo-Dutch War Battle of Gabbard Battle of Dungeness

Sea Battle 1650 - 1670

Experimental Evaluation

User Study

20 queries

Pooling top-10 results returned by the three methods

Relevance assessment by 15 users

highly relevant: 2

marginally relevant: 1

irrelevant: 0

NDCG as a measure of effectiveness

Experimental Evaluation

Experimental Evaluation

Thank you!

Questions?

Conclusion

Documents are rich of temporal expressions, but existing retrieval models are ignorant of their inherent semantics

Our work proposes two methods addressing this problem

Initial experimental evidence shows that our methods improve retrieval effectiveness for temporal information needs

Experimental Evaluation

Queries

1 Mergers and Acquisitions <2001-2004>

2 United States Railway <1800-1900>

3 Folklore Music <1700-1799>

4 Earthquake <1980-1990>

5 Sea Battle <1650 - 1700>

6 United States Secretary of State <1950 - today>

7 Native Americans <1950 - today>

8 German Architecture <1919 - 1933>

9 Internet <1950 - 1995>

10 Olympic Games <1976>

Queries

11 Blues Music <1900 - 1930>

12 Personal Computer <1975 - 1985>

13 Clint Eastwood <1970 - 1979>

14 Black Death Spain <1600 - 1699>

15 Italian Fascism <1920 - 1950>

16 George Bush <1989 - 1992>

17 Flying Machine <1500 - 1799>

18 Spanish Painter <18th Century>

19 Economic Situation Germany <1920s>

20 Ford Motor Company <1900-1930>

generative model associated with T =[b,e]

e e+α(e-b)

b’b eb-α(e-b)

P(b’) P(e’)

Weighted Approach

only generates overlapping intervals of T

P(b’,e’) ~ |overlap|

Our Baseline: Ponte and Croft‘s Model (LM)

Query likelihood: the likelihood that a query q and a document d is generated by the same language model

depends on the term frequency of query words in the document and their collection frequency

txtx qq

txqq

txtxtx dqPdqPdqP )|(0.1)|()|(