![Page 1: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/1.jpg)
How Search Engines Work:A Technology Overview
Avi RappoportSearch Tools Consulting
UC Berkeley SIMS class 202September 16, 2004
![Page 2: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/2.jpg)
2
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Purpose of Search Engines
Helping people find what they’re looking for• Starts with an “information need”• Convert to a query• Gets results
In the materials available• Web pages• Other formats• Deep Web
![Page 3: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/3.jpg)
3
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search is Not a Panacea
Search can’t find what’s not there• The content is hugely important
Information Architecture is vitalUsable sites have good navigation
and structure
![Page 4: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/4.jpg)
4
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Looks Simple
![Page 5: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/5.jpg)
5
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
But It's Not
Index ahead of time• Find files or records• Open each one and read it • Store each word in a searchable index
Provide search forms• Match the query terms with words in the
index• Sort documents by relevance
Display results
![Page 6: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/6.jpg)
6
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Processing
![Page 7: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/7.jpg)
7
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
content
search functionali
ty
user interfac
e
Search is Mostly Invisible
Like an iceberg,2/3 below water
![Page 8: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/8.jpg)
8
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Text Search vs. Database Query
Text search works for structured content
Keyword search vs. SQL queriesApproximate vs. exact matchMultiple sources of contentResponse time and database resourcesRelevance ranking, very importantWorks in the real world (e.g. EBay)
![Page 9: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/9.jpg)
9
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search is Only as Good as the Content
Users blame the search engine • Even when the content is unavailable
Understand the scope of site or intranet• Kinds of information• Divided sites: products / corporate info• Dates• Languages• Sources and data silos: CMSs, databases...• Update processes
![Page 10: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/10.jpg)
10
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Making a Searchable Index
Store text to search it laterMany ways to gather text
• Crawl (spider) via HTTP• Read files on file servers• Access databases (HTTP or API)• Data silos via local APIs• Applications, CMSs, via Web Services
Security and Access Control
![Page 11: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/11.jpg)
11
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Robot Indexing Diagram
Source:James Ghaphery, VCU
![Page 12: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/12.jpg)
12
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
What the Index Needs
Basic information for document or record• File name / URL / record ID• Title or equivalent• Size, date, MIME type
Full text of item More metadata
• Product name, picture ID• Category, topic, or subject• Other attributes, for relevance ranking and
display
![Page 13: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/13.jpg)
13
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Simple Index Diagram
![Page 14: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/14.jpg)
14
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
More Complex Index Processing
![Page 15: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/15.jpg)
15
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Index Issues
StopwordsStemmingMetadata
• Explicit (tags)• Implicit (context)
Semantics• CMS and Database fields• XML tags and attributes
![Page 16: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/16.jpg)
16
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Query Processing
What happens after you click the search button, and before retrieval starts.
Usually in this order• Handle character set, maybe language• Look for operators and organize the query• Look for field names or metadata• Extract words (just like the indexer)• Deal with letter casing
![Page 17: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/17.jpg)
17
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search and Retrieval
Retrieval: find files with query termsNot the same as relevance ranking
Recall: find all relevant items
Precision: find only relevant items
Increasing one decreases the other
![Page 18: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/18.jpg)
18
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Retrieval = Matching
Single-word queries• Find items containing that word
Multi-word queries: combine lists• Any: every item with any query word• All: only items with every word• Phrases: find only items with all words in
orderBoolean and complex queries
• Use algorithm to combine lists
![Page 19: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/19.jpg)
19
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Why Searches Fail
Empty searchNothing on the site on that topic
(scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure
![Page 20: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/20.jpg)
20
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
LII.org No-Matches Page
![Page 21: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/21.jpg)
21
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Relevance Ranking
Theory: sort the matching items, so the most relevant ones appear first
Can't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguous
• What do people mean when they type “bank”?First 10 results are the most importantThe more transparent, the better
![Page 22: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/22.jpg)
22
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Relevance Processing
Sorting documents on various criteriaStart with words matching query termsCitation and link analysis
• Like old library Citation Indexes• Ted Nelson - not only hypertext, but the
links• Google PageRank
• Incoming links• Authority of linkers
Taxonomies and external metadata
![Page 23: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/23.jpg)
23
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
TF-IDF Ranking Algorithm
Term frequency in the itemInverse document frequency of term
• Rare words are likely to be more importantwij = weight of Term Tj in Document
Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collectionn = number of Documents where term Tj occurs at least once
From Salton 1989
![Page 24: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/24.jpg)
24
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Other Algorithms
Vector space Probabilistic (binary interdependence) Fuzzy set theory Bayesian statistical analysis Latent semantic indexing Neural networks Machine learning All require sophisticated queries See MIR, chapter 2
![Page 25: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/25.jpg)
25
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Relevance Heuristics
Heuristics are rules of thumb• Not algorithms, not math
Search Relevance Ranking Heuristics• Documents containing all search words• Search words as a phrase• Matches in title tag• Matches in other metadata
Based on real-word user behavior
![Page 26: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/26.jpg)
26
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Results Interface
What users see after they click the Search button
The most visible part of searchElements of the results page
• Page layout and navigation• Results header• List of results items• Results footer
![Page 27: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/27.jpg)
27
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Many Experiments in Interface
![Page 28: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/28.jpg)
28
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Back to Simplicity
![Page 29: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/29.jpg)
29
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Suggestions (aka Best Bets)
Human judgment beats algorithmsGreat for frequent, ambiguous searches
• Use search log to identify best candidatesRecommend good starting pages
• Product information, FAQs, etc.
Requires human resources• That means money and time
More static than algorithmic search
![Page 30: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/30.jpg)
30
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
MSU Keywords
![Page 31: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/31.jpg)
31
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Siemens Results
![Page 32: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/32.jpg)
32
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Cooks.com Results
![Page 33: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/33.jpg)
33
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Salon.com Results
![Page 34: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/34.jpg)
34
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Faceted Metadata Search & Browse
Leverage content structure• database fields (i.e. cruise amenities)• document metadata (news article bylines)
Provide both search and browse• Support information foraging• Integrate navigation with results• Not just subject taxonomies• Display only fruitful paths, no dead ends
Supported by academic research• Marti Hearst, UCB SIMS, flamenco.berkeley.edu
![Page 35: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/35.jpg)
35
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Faceted Search: Information
![Page 36: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/36.jpg)
36
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Faceted Search: Online Catalog
![Page 37: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/37.jpg)
37
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Metrics and Analytics
Metrics• Number of searches• Number of no-matches searches• Traffic from search to high-value pages• Relate search changes to other metrics
Search Log Analysis• Top 5% searches: phrases and words• Top no-matches searches
• Use as market research
![Page 38: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/38.jpg)
38
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Will Never Be Perfect
Search engines can’t read minds• User queries are short and ambiguous
Some things will help• Design a usable interface • Show match words in context• Keep index current and complete• Adjust heuristic weighting• Maintain suggestions and synonyms• Consider faceted metadata search
![Page 39: How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting consult1@searchtools.com UC Berkeley SIMS class](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56649f485503460f94c6a262/html5/thumbnails/39.jpg)
39
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Search Engines, sorta Rocket Science
Questions and discussionContact me
• [email protected]• www.searchtools.com
This presentation: • www.searchtools.com/slides/sims/202-04
/