special topics in computer science the art of information retrieval chapter 5: query operations...
TRANSCRIPT
![Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 5: Query OperationsChapter 5: Query Operations
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/2.jpg)
2
Previous chapter: ConclusionsPrevious chapter: Conclusions
Query languages (width-wide):o words, phrases, proximity, fuzzy Boolean, natural
language
Query languages (depth-wide):o Pattern matching
If return sets, can be combined using Boolean model Combining with structure
o Hierarchical structure
Standardized low level languages: protocolso Reusable
![Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/3.jpg)
3
Previous chapter: Previous chapter: Trends and research topicsTrends and research topics
Models: to better understand the user needs Query languages: flexibility, power, expressiveness,
functionality Visual languages
o Example: library shown on the screen. Act: take books, open catalogs, etc.
o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
![Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/4.jpg)
4
Query operationsQuery operations
Users have difficulties formulating queries Program improves the query
o Interactive mode: using the user’s feedback
o Using info from the retrieved set
o Using linguistic information or information from the collection
Query expansiono add new terms
Term rewritingo modify weights
![Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/5.jpg)
5
11stst method: User relevance feedback method: User relevance feedback
User examines to 10 (20) docs and marks relevant ones System uses this to construct new query
o Moved toward relevant docs
o Away from irrelevant
Good: simplicity
Note: In all the chapter, the correct spelling is Rocchio
![Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/6.jpg)
6
User relevance feedback:User relevance feedback:
Vector Space ModelVector Space Model
Best vector to distinguish good from bad docs: avg good minus avg bad
![Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/7.jpg)
7
User relevance feedback:User relevance feedback:
Vector Space ModelVector Space Model
Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: <
= 0: Positive feedback
![Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/8.jpg)
8
User relevance feedback:User relevance feedback:
Probabilistic ModelProbabilistic Model
User feedback:
Smoothing is usually applied Bad:
o No document weightso Previous history losto No new terms, only weights are changed
![Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/9.jpg)
9
... a variant for Probabilistic Model... a variant for Probabilistic Model
Similarity is multiplied by TF (term frequency)o Not exactly, but this is the idea
o Initially, IDF is also taken into account
o Details in the book
Still no query expansion, only re-weighting the original terms
![Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/10.jpg)
10
Evaluation of Relevance FeedbackEvaluation of Relevance Feedback
Simplistic:o Evaluate precision and recall after the feedback cycle
o Not realistic since includes the user’s own feedback
Better:o Only consider unseen data
o Use the rest of the collection
o Not as good figures
o Useful to compare different methods, not to compare precision/recall before and after feedback
![Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/11.jpg)
11
22ndnd method: Automatic local analysis method: Automatic local analysis
Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationshipso Based on clustering technoques
Global vs Local strategy:o Global: the whole collection is used for this
o Local: the retrieved set. Similar to feedback, but automatic.
Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming.o Good for local collections, not for Web
Build clusters of words; add to each keyword its neighbors
![Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/12.jpg)
12
Clustering (words)Clustering (words)
Association clusterso Terms that co-occur in the docso The clusters are the n terms that occur most frequently to
gether with the query terms (normalized vs. non-) Metric clusters (better)
o Multiplies the number of co-occurrences by the proximity in the text
o Terms that occur in the same sentence are more related Scalar clusters
o Terms co-occurring with the same other terms are relatedo Relatedness of two words = scalar product of centroids of
their association clusters
![Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/13.jpg)
13
... variant (local clustering)... variant (local clustering)
Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection
, not 5:
![Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/14.jpg)
14
33rdrd Method: Automatic Global Analysis Method: Automatic Global Analysis
Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added
(query expansion)
![Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/15.jpg)
15
Similarity thesaurusSimilarity thesaurus
Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:
![Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/16.jpg)
16
(Global) Statistical thesaurus...(Global) Statistical thesaurus...
Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.:
o Each doc is a clustero Merge two most similar clusters = their docs are similaro Repeat until <condition>
page 136:
![Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/17.jpg)
17
... statistical thesaurus... statistical thesaurus
Convert the cluster hierarchy into a set of clusterso Use a threshold similarity level to cut the hierarchy
o Don’t take too large clusters
Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same classo threshold
o These give clusters of words
Calculate weight of each class of terms. Add these terms with this weight to the query terms
![Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/18.jpg)
18
Research topicsResearch topics
Interactive interfaceso Graphical, 2D or 3D
Refining global analysis techniques Application of linguistics methods. Stemming.
Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)
![Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/19.jpg)
19
ConclusionsConclusions
Relevance feedbacko Simple, understandable
o Needs user attention
o Term re-weighting
Local analysis for query expansiono Co-occurrences in the retrieved docs
o Usually gives better results than global analysis
o Computationally expensive
Global analysiso Not as good results, since what is good for the whole
collection is not good for a specific query
o Linguistic methods, dictionaries, ontologies, stemming, ...
![Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/20.jpg)
20
ExamExam
Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23
![Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh](https://reader036.vdocuments.mx/reader036/viewer/2022062511/5514777c550346b0158b5368/html5/thumbnails/21.jpg)
21
Thank you!Till October 23
October 23:discussion of the midterm exam,
class moved from October 30
The class of Oct The class of Oct 30 is moved to 30 is moved to
2323