cikm’10 advisor ： jia ling, koh speaker ： sheng hong, chung

Using The Past To Score The Present: Extending Term Weighting Models with

Revision History Analysis

CIKM’10Advisor ： Jia Ling, KohSpeaker ： SHENG HONG, CHUNG

Outline

• Introduction• Revision History Analysis– Global Revision History Analysis– Edit History Burst Detection– Revision History Burst Analysis

• Incorporating RHA in retrieval models• System implementation• Experiment• Conclusion

Introduction

• Many researches will use modern IR models– Term weighting becomes central part of these

models– Frequency-based

• These models only examine one(final) version of the document to be retrieved, ignoring the actual document generation process.

IR model

document

original

after many revisiondocument

latest

Term frequency

True term frequency

Introduction

• New term weighting model– Use the revision history of the document– Redefine term frequency– In order to obtain a better characterization of

term’s true importance in a document

• Global revision history analysis– Simplest RHA model– document grows steadily over time– a term is relatively important if it appears in the

early revisions.

d ： document d form a versioned corpus DV = { v1,v2,….,vn } ： revision history of dc(t,d) ： frequency of term t in d ： decay factor

𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1

𝑛 𝑐 (𝑡 ,𝑣 𝑗)

𝑗𝛼Frequency of term

in revision

Decay factor

Revision History Analysisd : { a,b,c } tf(a=3 b=2 c=1)

V = {v1,v2,v3}v1 = {a,b,c} tf(a=4 b=3 c=3)v2 = {a,b,c} tf(a=5 b=2 c=1)v3 = {a,b,c,e} tf(a=5 b=3 c=2 e=2)

TFglobal(a,d) = 4/1+5/2+5/3 = 4/1+5/2.14355+5/3.34837 = 4+2.333+1.493 = 7.826

TFglobal(e,d) = 0/1+0/2+2/3 = 0.597

1st revision:

500th revision:

Current revision:

TimeTerm Frequency

Document Length“Pandora” “James Cameron”

Nov. 2009 9 23 2576Dec. 2009 25 50 6306

Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892

First photo & trailer released Movie released

Burst of Document (Length) & Change of Term Frequency

Burst of Edit Activity & Associated Events

Global Model might be insufficient

Edit History Burst Detection

• Content-based• Relative content change potential burst

: content length for j-th revision

Edit History Burst Detection

• Activity-based• Intensive edit activity potential bursts

Average revision counts

Deviation

ℬ𝑢𝑟𝑠𝑡❑ (𝑣 𝑗 )={1 , 𝑖𝑓 𝐵𝑢𝑟𝑠𝑡𝑐 (𝑣 𝑗 )+𝐵𝑢𝑟𝑠𝑡𝑎 (𝑣 𝑗 )>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

Revision History Burst Analysis

• A burst resets the decay clock for a term.• The weight will decrease after a burst.

𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1

∑𝑘=𝑏 𝑗

𝑛 𝑐 (𝑡 ,𝑣𝑘)(𝑘−𝑏 𝑗+1)

Frequency of term in revision

Decay factor for jth Burst

B = {b1,b2,….bm} : the set of burst indicators for document dbj : the value of bj is the revision index of the end of the j-th burst of document d

W : decay matrixi : a potential burst positionj : a document revision

U = [u1,u2…un] : the burst indicator that will be used to filter the decay matrix W to contain only the true bursts

d : { a,b,c } tf(a=3 b=2 c=1)V = {v1,v2,v3,v4}B = {b1,b2,b3,b4} = {1,0,1,0}V1 = {a,b,c,d} tf(a=50 b=20 c=30 d=10)V2 = {a,b,c,d} tf(a=52 b=21 c=33 d=10)V3 = {a,b,c,d} tf(a=70 b=35 c=40 d=20)V4 = {a,b,c,d} tf(a=73 b=33 c=48 d=21)

Incorporating RHA in retrieval models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙ |𝐷|𝑎𝑣𝑔𝑑𝑙 )

𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:

ndicate the weights of RHA global model, burst model and original term frequency (probability).

𝜆1+𝜆2+𝜆3=1RHA Term Probability:

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )

System implementation

The date of creating/editing.Content change

Evaluate metrics

• Queries and Labels:– INEX: provided– TREC: subset of ad-hoc track

• Metrics: – Bpref (robust to missing judgments)– MAP: mean average precision– R-prec: precision at position R– NDCG: normalized discounted cumulative gain

DatasetINEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability

INEX 64 topic

cikm’10 advisor ： jia ling, koh speaker ： sheng hong, chung

jth burst b

current revision

revision index

set of burst indicators

potential burst positionj

rha term probability

term weighting models

timea term

Documents

1 context-aware search personalization with concept...

date ： 2014/12/04 author ： parikshit sondhi, chengxiang...

on sparsity and drift for effective real- time filtering in...

cikm 2014 v2

cikm 2011 social computing industry invited talk

dynamic multi-faceted topic discovery in twitter date :...

date : 2014/04/01 author : zhung-xun liao, yi-chin pan,...

measuring innovation around the world revised · measuring...

date: 2011/12/26 source: dustin lange et. al (cikm’11)...

cikm keynote nov2014

seeking statement-supporting top-k witnesses date:...

incident threading for news passages (cikm 09) speaker:...

1 blog site search using resource selection 2008 acm cikm...

cikm’09 date:2010/8/24 advisor: dr. koh, jia-ling speaker:...

date: 2013/6/10 author: shiwen cheng, arash termehchy,...

tag clouds revisited date : 2011/12/12 source : cikm’11...

date: 2014/02/25 author: aliaksei severyn, massimo nicosia,...

1 statistical source expansion for question answering...

energy @ group 5 f 01 tiffany koh ee hui jie tay kang sheng...

koh tral and koh krachakses