outlinetrec enterprise track 2007 how realistic is this?! … · 2008. 7. 31. · nanohouse...
Post on 22-Aug-2020
2 Views
Preview:
TRANSCRIPT
A Few Examples Go A Long Way
Krisztian Balog, Wouter Weerkamp, Maarten de Rijke
Constructing Query Models from Elaborate Query Formulations
ISLA, University of Amsterdamhttp://ilps.science.uva.nl
31st annual international ACM SIGIR Conference on Research and Development in Information RetrievalSingapore, Singapore July 20 - 24, 2008
Motivation• Task: create an overview page for a given
topic — find documents that discuss the topic in detail
+
“User’s input to the search engine”
cancer risk
How realistic is this?!
• Enterprise setting
• Users are willing to provide their information need a more elaborate form
• a few keywords
• a sample documents
• Sample documents can be obtained from click-through data along the way
Research Questions
• Can we make use of these sample documents in an effective and theoretically transparent manner?
• What is the effect of lifting the conditional dependence between the original query and expansion terms?
• Can we improve “aspect recall”?
Outline
• TREC Enterprise Track 2007
• Query model from sample documents
• Comparison with relevance models
• Results
• Conclusions and further work
TREC Enterprise Track 2007
• Document collection: web crawl of CSIRO (~370.000 docs, 4.2 GB)
• 50 topics
• Topic description is enriched with sample documents (on average 3 examples/topic)
• Relevance judgments on a 3-point scale(not relevant, possibly relevant, highly relevant)
Example Topic<top>
<num>CE-012</num><query>cancer risk</query><narr>Focus on genome damage and therefore cancer risk in humans.</narr><page>CSIRO145-10349105</page><page>CSIRO140-15970492</page><page>CSIRO139-07037024</page><page>CSIRO138-00801380</page>
</top>
<top><num>CE-012</num><query>cancer risk</query><narr>Focus on genome damage and therefore cancer risk in humans.</narr><page>CSIRO145-10349105</page><page>CSIRO140-15970492</page><page>CSIRO139-07037024</page><page>CSIRO138-00801380</page></top>
Example
Outline
• TREC Enterprise Track 2007
• Query model from sample documents
• Comparison with relevance models
• Results
• Conclusions and further work
Retrieval Model
• Standard Language Modeling
• Ranking documents by their likelihood of being relevant given the query Q:
Retrieval Model (2)
documentmodel
querymodel
• Assuming uniform document priors, it provides the same ranking as minimizing the KL-divergence:
Query Modeling• Baseline QM assign probability mass
uniformly across query terms
• Potential issues
• Not all query terms are equally important
• The query model is extremely sparse
• Solution: query expansion
Original query
Expandedquery
A Query Model from Sample Documents
sampling distributionsample
documents
expandedquery
top K
terms
...
Importance of a Sample Document
1. Uniform
• All sample document are equally important
2. Query-biased
• A sample document’s importance is proportional to its relevance to the query
3. Inverse query-biased
• We reward documents that bring in new aspects
A Query Model from Sample Documents
sampling distributionsample
documents
expandedquery
top K
terms
...
Estimating Term Importance
1. Maximum likelihood estimate
2. Smoothed estimate
3. Ranking function by Ponte (2000)
J. Ponte, “Language models for relevance feedback”, in Advances in Information Retrieval, ed. W.B. Croft, 73-96, 2000.
Research Questions
• Can we make use of these sample documents in an effective and theoretically transparent manner?
• What is the effect of lifting the conditional dependence between the original query and expansion terms?
• Can we improve “aspect recall”?
ResultsTerm Importance
Method MAP MRRBaseline (no expansion) 0.3576 0.7134(ML) Maximum Likelihood 0.4449 0.8533(SM) Smoothed 0.4406 0.8771(EXP) Ponte Q.Exp. 0.4016 0.8148
• Improvement can be up to 24% in MAP and 23% in MRR
★ Results reported on relevance level 1 (“possibly relevant”); see Table 3 in the paper for results on both relevance levels. P(D|S) is uniform.
ResultsDocument Importance
• Biasing sampling on the original query hurts MAP, but improves on early precision
• P(t|D) estimated using SM and EXP display similar behavior
P(t|D) P(D|S) MAP MRR
MLUniform 0.4449 0.8533Query biased 0.4294 0.8810Inverse query-biased 0.4184 0.8268
★ Results reported on relevance level 1 (“possibly relevant”); see Table 4 in the paper for results on both relevance levels.
Outline
• TREC Enterprise Track 2007
• Query model from sample documents
• Comparison with relevance models
• Results
• Conclusions and further work
Comparison with Relevance Models
• Lavrenko and Croft (SIGIR 2001)
• Estimate using the joint probability of observing t with query terms in feedback documents
• Blind relevance feedback
• Use sample documents as feedback docs
• Two methods with different independence assumptions (RM1, RM2)
ResultsComparison with Relevance Models
Method MAP MRRBaseline (no expansion) 0.3576 0.7134Relevance models (RM2)- Blind relevance feedback 0.3677 0.6703- Sample documents 0.4273 0.9029Query model from sample documents(ML) Maximum Likelihood 0.4449 0.8533(SM) Smoothed 0.4406 0.8771(EXP) Ponte Q.Exp. 0.4016 0.8148
★ Results reported on relevance level 1 (“possibly relevant”); see Table 3 in the paper for results on both relevance levels. P(D|S) is uniform.
Results“Aspect recall”
• Research questions
• Do sample documents provide aspects not covered by the original query?
• Does avoiding biasing term selection toward the original query help to identify these additional aspect?
Relevance BaselineRelevance models QM from
sample d.blind fb. sample d.possibly 5 445 5 582 5 882 6 052highly 2 763 2 816 2 929 3 047
ResultsTopic-level comparison
0
0.225
0.450
0.675
0.900
Ave
rage
Pre
cisi
on
Baseline RM2, blind fb. RM2, sample docs QM, sample docs
#36
AP
Baseline 0.5681
RM2, blind fb. 0.7971
RM2, sample docs 0.1205
QM, sample docs 0.2342
<num>CE-036</num><query>termites</query><narr>Resources describing termites or ‘white ants’ as well as food identification through vibrations will all contain useful information. Current CSIRO research in termite pest management looks at deterring termites through non-chemical means using the vibrations of wood (termite food) to manipulate their feeding habits.</narr>
termitescsiro
woodfood
termitevibrations
blocksspecies
australianmade
termitessite
informationlegal
noticedisclaimer
privacyweb
subjectdrywood
termitessite
informationlegal
noticeprivacy
disclaimerdrywood
statementsubject
ResultsTopic-level comparison
0
0.225
0.450
0.675
0.900
Ave
rage
Pre
cisi
on
Baseline RM2, blind fb. RM2, sample docs QM, sample docs
#35
<num>CE-035</num><query>nanohouse</query><narr>CSIRO have developed a model house that shows how new materials, products and processes that are emerging from nanotechnology research and development might be applied to our living environment. ... Resources describing molecular and nanoscale components, industrial physics, biomimetics, nanoparticle films, biosensors and molecular electronics would all be relevant to this topic.</narr>
AP
Baseline 0.0451
RM2, blind fb. 0.1290
RM2, sample docs 0.1457
QM, sample docs 0.3810
nanohouse
nanotechnology
csiro
cameron
dr
research
technology
gene
control
fiona
nanohouse
nanotechnology
csiro
technology
research
conference
australia
molecules
chemistry
information
nanohouse
physics
csiro
nanoscale
nanotechnology
materials
devices
structures
molecular
building
Wrap up
• Method for sampling expansion terms in a query-independent way
• Various expansions based on term and document importance weighting
• Outperforms a high performing baseline as well as query-dependent expansion methods
• Helps to address the “aspect recall” problem
Further Work
• Other ways of exploitings sample documents
• Layout, link structure, document structure, etc.
• Combining terms extracted from blind feedback documents with terms from sample documents
Further work (2)
• Use expanded query models for expert finding [Balog and de Rijke, CIKM 2008]
A Few Examples Go A Long Way
K.Balog@uva.nlhttp://www.science.uva.nl/~kbalog
Relevance ModelsLavrenko and Croft (SIGIR, 2001)
•
• RM1 (all query terms are conditioned on t)
• RM2 (pairwise independence assumption)
p(t|q̂) ! p(t, q1, . . . , qn)!t! p(t!, q1, . . . , qn)
p(t, q1...qk) =!d!M
p(d) · p(t|d)k"
i=1
p(qi|d)
p(t, q1...qk) = p(t)k!
i=1
"d!M
p(d|t) · p(qi|d)
top related