search session 12 lbsc 690 information technology
Post on 20-Dec-2015
219 views
TRANSCRIPT
![Page 1: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/1.jpg)
Search
Session 12
LBSC 690
Information Technology
![Page 2: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/2.jpg)
Agenda
• The search process
• Information retrieval
• Recommender systems
• Evaluation
![Page 3: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/3.jpg)
Information “Retrieval”
• Find something that you want– The information need may or may not be explicit
• Known item search– Find the class home page
• Answer seeking– Is Lexington or Louisville the capital of Kentucky?
• Directed exploration– Who makes videoconferencing systems?
![Page 4: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/4.jpg)
DocumentDelivery
BrowseSearch
Query Document
Select Examine
Information Retrieval Paradigm
![Page 5: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/5.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
![Page 6: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/6.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 7: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/7.jpg)
Human-Machine Synergy
• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time
• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”
• Both are pretty bad at:– Mapping consistently between words and concepts
![Page 8: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/8.jpg)
Search Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
![Page 9: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/9.jpg)
Ways of Finding Text
• Searching metadata– Using controlled or uncontrolled vocabularies
• Free text– Characterize documents by the words the contain
• Social filtering– Exchange and interpret personal ratings
![Page 10: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/10.jpg)
“Exact Match” Retrieval
• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss
• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order
![Page 11: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/11.jpg)
Ranked Retrieval
• Put most useful documents near top of a list– Possibly useful documents go lower in the list
• Users can read down as far as they like– Based on what they read, time available, ...
• Provides useful results from weak queries– Untrained users find exact match harder to use
![Page 12: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/12.jpg)
Similarity-Based Retrieval
• Assume “most useful” = most similar to query
• Weight terms based on two criteria:– Repeated words are good cues to meaning– Rarely used words make searches more selective
• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first
![Page 13: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/13.jpg)
Simple Example: Counting Words
1
1
1
1: Nuclear fallout contaminated Texas.
2: Information retrieval is interesting.
3: Information retrieval is complicated.
1
1
1
1
1
1
nuclear
fallout
Texas
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Documents:
Query: recall and fallout measures for information retrieval
1
1
1
Query
![Page 14: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/14.jpg)
Discussion Point: Which Terms to Emphasize?
• Major factors– Uncommon terms are more selective– Repeated terms provide evidence of meaning
• Adjustments– Give more weight to terms in certain positions
• Title, first paragraph, etc.
– Give less weight each term in longer documents– Ignore documents that try to “spam” the index
• Invisible text, excessive use of the “meta” field, …
![Page 15: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/15.jpg)
“Okapi” Term Weights
5.0
5.0log*
5.05.1 ,
,,
j
j
jii
jiji DF
DFN
TFLL
TFw
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25
Raw TF
Oka
pi
TF 0.5
1.0
2.0
4.4
4.6
4.8
5.0
5.2
5.4
5.6
5.8
6.0
0 5 10 15 20 25
Raw DF
IDF Classic
Okapi
LL /
TF component IDF component
![Page 16: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/16.jpg)
Index Quality
• Crawl quality– Comprehensiveness, dead links, duplicate detection
• Document analysis– Frames, metadata, imperfect HTML, …
• Document extension– Anchor text, source authority, category, language, …
• Document restriction (ephemeral text suppression)– Banner ads, keyword spam, …
![Page 17: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/17.jpg)
Indexing Anchor Text
• A type of “document expansion”– Terms near links describe content of the target
• Works even when you can’t index content– Image retrieval, uncrawled links, …
![Page 18: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/18.jpg)
Queries on the Web (1999)
• Low query construction effort– 2.35 (often imprecise) terms per query– 20% use operators– 22% are subsequently modified
• Low browsing effort– Only 15% view more than one page– Most look only “above the fold”
• One study showed that 10% don’t know how to scroll!
![Page 19: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/19.jpg)
Types of User Needs
• Informational (30-40% of AltaVista queries)– What is a quark?
• Navigational – Find the home page of United Airlines
• Transactional– Data: What is the weather in Paris?– Shopping: Who sells a Viao Z505RX?– Proprietary: Obtain a journal article
![Page 20: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/20.jpg)
Searching Other Languages
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
MT
Translated “Headlines”
English Definitions
![Page 21: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/21.jpg)
![Page 22: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/22.jpg)
Speech Retrieval Architecture
AutomaticSearch
BoundaryTagging
InteractiveSelection
ContentTagging
SpeechRecognition
QueryFormulation
![Page 23: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/23.jpg)
Rating-Based Recommendation
• Use ratings as to describe objects– Personal recommendations, peer review, …
• Beyond topicality:– Accuracy, coherence, depth, novelty, style, …
• Has been applied to many modalities– Books, Usenet news, movies, music, jokes, beer, …
![Page 24: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/24.jpg)
Using Positive InformationSmallWorld
SpaceMtn
MadTea Pty
Dumbo Speed-way
CntryBear
Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A
![Page 25: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/25.jpg)
Using Negative InformationSmallWorld
SpaceMtn
MadTea Pty
Dumbo Speed-way
CntryBear
Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A
![Page 26: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/26.jpg)
Problems with Explicit Ratings
• Cognitive load on users -- people don’t like to provide ratings
• Rating sparsity -- needs a number of raters to make recommendations
• No ways to detect new items that have not rated by any users
![Page 27: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/27.jpg)
Segment Object Class
Examine View Select
Retain
BookmarkSavePurchasePrintDelete
Subscribe
Reference QuoteCut&Paste
CiteLinkReplyForward
Interpret AnnotateRatePublishOrganize
Implicit Evidence for Ratings
![Page 28: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/28.jpg)
Click Streams
• Browsing histories are easily captured– Send all links to a central site– Record from and to pages and user’s cookie– Redirect the browser to the desired page
• Reading time is correlated with interest– Can be used to build individual profiles– Used to target advertising by doubleclick.com
![Page 29: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/29.jpg)
Estimating Authority from Links
Authority
Authority
Hub
![Page 30: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/30.jpg)
Information Retrieval Types
Source: Ayse Goker
![Page 31: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/31.jpg)
Hands On: Try Some Search Engines• Web Pages (using spatial layout)
– http://kartoo.com/
• Images (based on image similarity)– http://elib.cs.berkeley.edu/photos/blobworld/
• Multimedia (based on metadata)– http://singingfish.com
• Movies (based on recommendations)– http://www.movielens.umn.edu
• Grey literature (based on citations)– http://citeseer.ist.psu.edu/
![Page 32: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/32.jpg)
Evaluation
• What can be measured that reflects the searcher’s ability to use a system? (Cleverdon, 1966)
– Coverage of Information
– Form of Presentation
– Effort required/Ease of Use
– Time and Space Efficiency
– Recall
– Precision
Effectiveness
![Page 33: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/33.jpg)
Relevant
Retrieved
|Rel|
|RelRet| Recall
|Ret|
|RelRet| Precision
Measures of Effectiveness
![Page 34: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/34.jpg)
Precision-Recall Curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Source: Ellen Voorhees, NIST
![Page 35: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/35.jpg)
Affective Evaluation
• Measure stickiness through frequency of use– Non-comparative, long-term
• Key factors (from cognitive psychology):– Worst experience– Best experience– Most recent experience
• Highly variable effectiveness is undesirable– Bad experiences are particularly memorable
![Page 36: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/36.jpg)
Other Web Search Quality Factors
• Spam suppression– “Adversarial information retrieval”– Every source of evidence has been spammed
• Text, queries, links, access patterns, …
• “Family filter” accuracy– Link analysis can be very helpful
![Page 37: Search Session 12 LBSC 690 Information Technology](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d455503460f94a2221d/html5/thumbnails/37.jpg)
Summary
• Search is a process engaged in by people
• Human-machine synergy is the key
• Content and behavior offer useful evidence
• Evaluation must consider many factors