cs798: information retrieval
DESCRIPTION
CS798: Information Retrieval. Charlie Clarke [email protected] Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data. Housekeeping. Web page : http://plg.uwaterloo.ca/~claclark/cs798. Area : “Applications/Databases” - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/1.jpg)
CS798: Information Retrieval
Charlie [email protected]
Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data.
![Page 2: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/2.jpg)
Housekeeping
Web page: http://plg.uwaterloo.ca/~claclark/cs798
Area: “Applications/Databases”
Meeting times: Mondays, 2:00-5:00, MC2036
![Page 3: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/3.jpg)
NLP DB
ML
IR
![Page 4: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/4.jpg)
Topics
1. Basic techniques
2. Searching, browsing, ranking, retrieval
3. Indexing algorithms and data structures
4. Evaluation
5. Application areas
![Page 5: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/5.jpg)
1. Basic Techniques
• Text representation & Tokenization
• Inverted indices
• Phrase searching example
• Vector space model
• Boolean retrieval
• Simple proximity ranking
• Test collections & Evaluation
![Page 6: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/6.jpg)
2. Retrieval and Ranking
• Probabilistic retrieval and Okapi BM25F
• Language modeling
• Divergence from randomness
• Passage retrieval
• Classification
• Learning to rank
• Implicit user feedback
![Page 7: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/7.jpg)
3. Indexing
• Algorithms and data structures
• Index creation
• Dynamic update
• Index compression
• Query processing
• Query optimization
![Page 8: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/8.jpg)
4. Evaluation
• Statistical foundations of evaluation
• Measuring Efficiency
• Measuring Effectiveness– Recall/Precision– NDCG – Other measures
• Building a test collection
![Page 9: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/9.jpg)
5. Application Areas
• Parallel retrieval architectures
• Web search (Link analysis/Pagerank)
• XML retrieval
• Filesystem search
• Spam filtering
![Page 10: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/10.jpg)
Other Topics (student projects)
• Image/video/speech retrieval
• Web spam
• Cross- and multi-lingual IR
• Clustering
• Advertising/Recommendation
• Distributed IR/Meta-search
• Question answering
• etc.
![Page 11: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/11.jpg)
Resources
Textbook (partial draft on Website):
Büttcher, Clarke & Cormack. Information Retrieval: Data Structures, Algorithms and Evaluation.
(start reading ch. 1-3)
Wumpus:
www.wumpus-search.org
![Page 12: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/12.jpg)
Grading
• Short homework exercises from text (10%)
• A literature review based on a topic area selected by the student with the agreement of the instructor (30%)
• 30-minute presentation on your selected topic (20%)
• Class project (40%) – details coming up..
![Page 13: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/13.jpg)
![Page 14: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/14.jpg)
“Documents”
• Documents are the basic units of retrieval in an IR system.
• In practice they might be: Web pages, email messages, LaTeX files, news articles, phone message, etc.
• Update: add, delete, append(?), modify(?)• Passages and XML elements are other
possible units of retrieval.
![Page 15: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/15.jpg)
Probability Ranking Principle
If an IR system’s response to a query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its users will be maximized.
![Page 16: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/16.jpg)
Evaluating IR systems
• Efficiency vs. effectiveness
• Manual evaluation– Topic creation and judging– TREC (Text REtreival Conference)– Google Has 10,000 Human Evaluators?
• Evaluation through implicit user feedback
• Specificity vs. exhaustivity
![Page 17: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/17.jpg)
<topic> <title> shark attacks </title>
<desc>Where do shark attacks occur in the world?
</desc>
<narr>Are there beaches or other areas that are particularly prone to shark attacks? Documents comparing areas and providing statistics are relevant. Documents describing shark attacks at a single location are not relevant.
</narr></topic>
![Page 18: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/18.jpg)
![Page 19: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/19.jpg)
![Page 20: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/20.jpg)
Class Project:Wikipedia Search
• Can we outperform Google on the Wikipedia?
• Basic project: Build a search engine for the Wikipedia (using any tools you can find).
• Ideas: Pagerank, spelling, structure, element retrieval, summarization, external information, user interfaces
![Page 21: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/21.jpg)
Class Project: Evaluation
• Each student will create and judge n topics.
• The value of n depends on the number of students. (But workload stays the same.)
• Quantitative measure of effectiveness.
• Qualitative assessment of user interfaces.
• Volunteer needed to operate the judging interface (for credit).
![Page 22: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/22.jpg)
Class Project: Organization• You may work in groups (check with me).• You may work individually (check with
me).• You may create and share tools with other
students. You get the credit. (e.g. Volunteer needed to set up a class wiki.)
• Programming can’t be avoided, but can be minimized. ☺
• Programming can also be maximized.
![Page 23: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/23.jpg)
Class Project: Grading
• Topic creation and judging: 10%
• Other project work: 30%– You are responsible for submitting one
experimental run for evaluation.– Other activities are up to you.
![Page 24: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/24.jpg)
![Page 25: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/25.jpg)
One line?
![Page 26: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/26.jpg)
Tokenization
• For English text: Treat each string of alphanumeric characters as a token.
• Number sequentially from the start of the text collection.
• For non-English text: Depends on the language (possible student projects)
• Other considerations: Stemming, stopwords, etc.
![Page 27: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/27.jpg)
![Page 28: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/28.jpg)
Inverted Indices
• Basic data structure
• More next day…
![Page 29: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/29.jpg)
![Page 30: CS798: Information Retrieval](https://reader036.vdocuments.mx/reader036/viewer/2022062322/568148d3550346895db5efa0/html5/thumbnails/30.jpg)
Plan• Sept 17:
– Inverted indices (from Chapter 3)– Index construction/Wumpus (Stefan)
• Sept 24:– Vector space model, Boolean retrieval, proximity– Basic evaluation methods
• October 1: – Probabilistic retrieval, language modeling– Start topic creation for class project
• October 8: Web search