an exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...
TRANSCRIPT
![Page 1: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/1.jpg)
An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library
TBAS2012, Barcelona. April 1, 2012
Diana Ransgaard Sørensen, Toine Bogers, Birger Larsen
Royal School of Library and Information Science, Copenhagen, Denmark
02/04/2012 1
![Page 2: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/2.jpg)
Outline
• Introduction
– Problem
– Our focus
– Goal
• Methodology
• Experiments
• Conclusion
02/04/2012 2
![Page 3: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/3.jpg)
Introduction
• Problem– Different document types contain different amounts of text (full
text vs. metadata-only)
– Some document types are more likely to be retrieved than others, regardless of relevance
• Our focus– How to best combine & rank different document types and
representations in a digital library setting?
• Goal– Present the user with a single ranked list containing the optimal
mix of document types
– Explore different techniques for integrating different document types and representations into a single results list
02/04/2012 3
![Page 4: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/4.jpg)
Outline
• Introduction
• Methodology
– Test collection
– Topics
– Experimental setup
• Experiments
• Conclusion
02/04/2012 4
![Page 5: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/5.jpg)
Test collection
• iSearch collection
– Based on the digital physics library arXiv.org
– Available from http://itlab.dbit.dk/~isearch
– Three different document types
• 18,443 metadata-only book records (BK)
• 291,246 metadata-only article records + abstracts (PN)
• 143,571 full-text article records, including metadata (PF)
– Topics
• 65 topics with graded relevance assessments
• Created by 23 lecturers and experienced postgraduate and graduate students from three different university departments of physics
02/04/2012 5
![Page 6: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/6.jpg)
Topics
• Each topic representation contains five fields
– Description of information sought
– User background knowledge
– Work task description
– Ideal answer
– Keywords• “What are the key search terms used
to express your situation and your information needs?”
02/04/2012 6
![Page 7: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/7.jpg)
Experimental setup
• Indexing & retrieval– Indri 5.0 toolkit
• Stop word filtering
• Stemming
– Language modeling algorithms with three different smoothing methods• Jelinek-Mercer smoothing (JM)
• Bayesian smoothing using Dirichlet priors (DIR)
• Two-stage smoothing (TWO)
• Evaluation– Normalized Discounted Cumulated Gain (NDCG)
– Two-tailed paired Student's t-test
02/04/2012 7
![Page 8: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/8.jpg)
Outline
• Introduction
• Methodology
• Experiments
1) Out of the box
2) Weighting
3) Fusion
• Conclusion
02/04/2012 8
![Page 9: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/9.jpg)
Experiments
1) Default settings + optimized baseline runs
2) Adjust weighting of the three document types
3) Fusing different document types
02/04/2012 9
![Page 10: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/10.jpg)
1) Out-of-the-box vs. optimized
• We optimize the settings of the system and of the retrieval model on a combined index of all three document types (BK, PF and PN)
– Using the default, out-of-the-box settings does not always provide the best retrieval performance
– Default parameter settings can be seen as a generalization over many different test collections
• Goal is to examine how much performance can improve over default settings in this integrated search scenario
02/04/2012 10
![Page 11: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/11.jpg)
1) Out-of-the-box vs. optimized
• What do we compare?
– Out-of-the-box
– Tuned
• What do we optimize?
(i) Stop word filtering: Yes or no
(ii) Krovetz stemming: Yes or no
(iii) LM smoothing parameters: λ [0-1] in steps of 0.1 μ [0-5000] in steps of 500
02/04/2012 11
![Page 12: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/12.jpg)
1) Out-of-the-box vs. optimized
Optimizeddefault.NDCG 0.3263
Default. NDCG 0.2856
02/04/2012 12
= statistical significance
![Page 13: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/13.jpg)
1) Out-of-the-box vs. optimized
Optimized baseline runs increases the NDCG scores by:
17.4% (JM)
9.8% (DIR)
17.2% (TWO)
The best performing model is JM with an
NDCG score of 0.3263
Baseline in remaining tests (weigthing and fusion).
02/04/2012 13
![Page 14: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/14.jpg)
2) Weighting document types
• Weights – range [0.0001, 0.2, 0.4, 0.6, 0.8, 1.0]
216 unique combinations of the three document types
02/04/2012 14
![Page 15: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/15.jpg)
2) Weighting: top 10 of 216
02/04/2012 15
Book records Metadata Fulltext
Optimizeddefault.NDCG 0.3263
![Page 16: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/16.jpg)
3) Fusing document types
• Three separate indexes, optimized runs in each
• Two types of fusion
– Round-robin merging
– Linear combination (LC) with score- or rank-normalization
02/04/2012 16
![Page 17: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/17.jpg)
3) Fusing document types
FusionNDCG 0.3286 0,7 %
(One index)
02/04/2012 17
![Page 18: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/18.jpg)
Outline
• Introduction
• Methodology
• Experiments
• Conclusion
– Discussion
– Future work
02/04/2012 18
![Page 19: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/19.jpg)
Conclusions
• As aspected optimization of the retrieval model produces beneficial results on a combined index of document types.
• Our approach for weighting document types is not an effective way of improving integrated search performance.
• Round-robin merging is not an effective strategy for integrating different document types.
• Fusion based on Linear Combination on individual indexes for the document types produces results that are slightly better than the baseline.
02/04/2012 19
![Page 20: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/20.jpg)
Discussion
We expected that weighting the document types in our combined index differently could boost performance even further, but this was not the case.
Trend of the best weighted runsBook records tended to have higher weights and article metadata and full text lower weights.
02/04/2012 20
![Page 21: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012](https://reader031.vdocuments.mx/reader031/viewer/2022020307/55ad55d11a28ab5f6b8b4822/html5/thumbnails/21.jpg)
Future work
• A more extensive analysis of the performance of the individual document types.Goal: more fruitful techniques for weighting them properly.
• Weighting: calculate document-specific weights based on analysis of different document features, instead of only assigning a weight based on the document type.
• Use the citation information from the documents available in the iSearch collection as an additional source of information.
02/04/2012 21