how search engines work
DESCRIPTION
asdTRANSCRIPT
-
How Search Engines Work:A Technology OverviewAvi RappoportSearch Tools [email protected] Berkeley SIMS class 202September 16, 2004
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Purpose of Search EnginesHelping people find what theyre looking forStarts with an information needConvert to a queryGets resultsIn the materials availableWeb pagesOther formatsDeep Web
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search is Not a PanaceaSearch cant find whats not thereThe content is hugely importantInformation Architecture is vitalUsable sites have good navigation and structure
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Looks Simple
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
But It's NotIndex ahead of timeFind files or recordsOpen each one and read it Store each word in a searchable indexProvide search formsMatch the query terms with words in the indexSort documents by relevanceDisplay results
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Processing
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search is Mostly Invisiblecontentsearch functionalityuser interfaceLike an iceberg, 2/3 below water
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Text Search vs. Database QueryText search works for structured contentKeyword search vs. SQL queriesApproximate vs. exact matchMultiple sources of contentResponse time and database resourcesRelevance ranking, very importantWorks in the real world (e.g. EBay)
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search is Only as Good as the ContentUsers blame the search engine Even when the content is unavailableUnderstand the scope of site or intranetKinds of informationDivided sites: products / corporate infoDatesLanguagesSources and data silos: CMSs, databases...Update processes
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Making a Searchable IndexStore text to search it laterMany ways to gather textCrawl (spider) via HTTPRead files on file serversAccess databases (HTTP or API)Data silos via local APIsApplications, CMSs, via Web ServicesSecurity and Access Control
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Robot Indexing DiagramSource:James Ghaphery, VCU
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
What the Index NeedsBasic information for document or recordFile name / URL / record IDTitle or equivalentSize, date, MIME typeFull text of itemMore metadataProduct name, picture IDCategory, topic, or subjectOther attributes, for relevance ranking and display
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Simple Index Diagram
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
More Complex Index Processing
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Index IssuesStopwordsStemmingMetadataExplicit (tags)Implicit (context)SemanticsCMS and Database fieldsXML tags and attributes
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Query ProcessingWhat happens after you click the search button, and before retrieval starts.Usually in this orderHandle character set, maybe languageLook for operators and organize the queryLook for field names or metadataExtract words (just like the indexer)Deal with letter casing
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search and RetrievalRetrieval: find files with query termsNot the same as relevance rankingRecall: find all relevant itemsPrecision: find only relevant itemsIncreasing one decreases the other
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Retrieval = MatchingSingle-word queriesFind items containing that wordMulti-word queries: combine listsAny: every item with any query wordAll: only items with every wordPhrases: find only items with all words in orderBoolean and complex queriesUse algorithm to combine lists
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Why Searches FailEmpty searchNothing on the site on that topic (scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
LII.org No-Matches Page
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Relevance RankingTheory: sort the matching items, so the most relevant ones appear firstCan't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguousWhat do people mean when they type bank?First 10 results are the most importantThe more transparent, the better
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Relevance ProcessingSorting documents on various criteriaStart with words matching query termsCitation and link analysis Like old library Citation IndexesTed Nelson - not only hypertext, but the linksGoogle PageRankIncoming linksAuthority of linkersTaxonomies and external metadata
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
TF-IDF Ranking AlgorithmTerm frequency in the itemInverse document frequency of termRare words are likely to be more importantwij = weight of Term Tj in Document Ditfij = frequency of Term Tj in Document DjN = number of Documents in collectionn = number of Documents where term Tj occurs at least once
From Salton 1989
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Other AlgorithmsVector spaceProbabilistic (binary interdependence)Fuzzy set theoryBayesian statistical analysisLatent semantic indexingNeural networksMachine learningAll require sophisticated queriesSee MIR, chapter 2
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Relevance HeuristicsHeuristics are rules of thumbNot algorithms, not mathSearch Relevance Ranking HeuristicsDocuments containing all search wordsSearch words as a phraseMatches in title tagMatches in other metadataBased on real-word user behavior
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Results InterfaceWhat users see after they click the Search buttonThe most visible part of searchElements of the results pagePage layout and navigationResults headerList of results itemsResults footer
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Many Experiments in Interface
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Back to Simplicity
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Suggestions (aka Best Bets)Human judgment beats algorithmsGreat for frequent, ambiguous searchesUse search log to identify best candidatesRecommend good starting pagesProduct information, FAQs, etc.Requires human resourcesThat means money and timeMore static than algorithmic search
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
MSU Keywords
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Siemens Results
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Cooks.com Results
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Salon.com Results
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Faceted Metadata Search & BrowseLeverage content structuredatabase fields (i.e. cruise amenities)document metadata (news article bylines)Provide both search and browseSupport information foragingIntegrate navigation with resultsNot just subject taxonomiesDisplay only fruitful paths, no dead endsSupported by academic researchMarti Hearst, UCB SIMS, flamenco.berkeley.edu
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Faceted Search: Information
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Faceted Search: Online Catalog
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Metrics and AnalyticsMetricsNumber of searchesNumber of no-matches searchesTraffic from search to high-value pagesRelate search changes to other metricsSearch Log AnalysisTop 5% searches: phrases and wordsTop no-matches searchesUse as market research
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Will Never Be PerfectSearch engines cant read mindsUser queries are short and ambiguousSome things will helpDesign a usable interface Show match words in contextKeep index current and completeAdjust heuristic weightingMaintain suggestions and synonymsConsider faceted metadata search
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
-
Search Engines, sorta Rocket ScienceQuestions and discussionContact [email protected] presentation: www.searchtools.com/slides/sims/202-04/
UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting
Adaptive PathConfidential 2001 Adaptive Path, LLC 2443 Fillmore Street #404 San Francisco, California 94115