how search engines work

Upload: aleksandar-sale-julovski

Post on 10-Oct-2015

6 views

Category:

Documents


0 download

DESCRIPTION

asd

TRANSCRIPT

  • How Search Engines Work:A Technology OverviewAvi RappoportSearch Tools [email protected] Berkeley SIMS class 202September 16, 2004

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Purpose of Search EnginesHelping people find what theyre looking forStarts with an information needConvert to a queryGets resultsIn the materials availableWeb pagesOther formatsDeep Web

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search is Not a PanaceaSearch cant find whats not thereThe content is hugely importantInformation Architecture is vitalUsable sites have good navigation and structure

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Looks Simple

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • But It's NotIndex ahead of timeFind files or recordsOpen each one and read it Store each word in a searchable indexProvide search formsMatch the query terms with words in the indexSort documents by relevanceDisplay results

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Processing

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search is Mostly Invisiblecontentsearch functionalityuser interfaceLike an iceberg, 2/3 below water

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Text Search vs. Database QueryText search works for structured contentKeyword search vs. SQL queriesApproximate vs. exact matchMultiple sources of contentResponse time and database resourcesRelevance ranking, very importantWorks in the real world (e.g. EBay)

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search is Only as Good as the ContentUsers blame the search engine Even when the content is unavailableUnderstand the scope of site or intranetKinds of informationDivided sites: products / corporate infoDatesLanguagesSources and data silos: CMSs, databases...Update processes

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Making a Searchable IndexStore text to search it laterMany ways to gather textCrawl (spider) via HTTPRead files on file serversAccess databases (HTTP or API)Data silos via local APIsApplications, CMSs, via Web ServicesSecurity and Access Control

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Robot Indexing DiagramSource:James Ghaphery, VCU

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • What the Index NeedsBasic information for document or recordFile name / URL / record IDTitle or equivalentSize, date, MIME typeFull text of itemMore metadataProduct name, picture IDCategory, topic, or subjectOther attributes, for relevance ranking and display

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Simple Index Diagram

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • More Complex Index Processing

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Index IssuesStopwordsStemmingMetadataExplicit (tags)Implicit (context)SemanticsCMS and Database fieldsXML tags and attributes

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Query ProcessingWhat happens after you click the search button, and before retrieval starts.Usually in this orderHandle character set, maybe languageLook for operators and organize the queryLook for field names or metadataExtract words (just like the indexer)Deal with letter casing

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search and RetrievalRetrieval: find files with query termsNot the same as relevance rankingRecall: find all relevant itemsPrecision: find only relevant itemsIncreasing one decreases the other

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Retrieval = MatchingSingle-word queriesFind items containing that wordMulti-word queries: combine listsAny: every item with any query wordAll: only items with every wordPhrases: find only items with all words in orderBoolean and complex queriesUse algorithm to combine lists

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Why Searches FailEmpty searchNothing on the site on that topic (scope)Misspelling or typing mistakesVocabulary differencesRestrictive search defaultsRestrictive search choicesSoftware failure

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • LII.org No-Matches Page

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Relevance RankingTheory: sort the matching items, so the most relevant ones appear firstCan't really know what the user wants Relevance is hard to define and situationalShort queries tend to be deeply ambiguousWhat do people mean when they type bank?First 10 results are the most importantThe more transparent, the better

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Relevance ProcessingSorting documents on various criteriaStart with words matching query termsCitation and link analysis Like old library Citation IndexesTed Nelson - not only hypertext, but the linksGoogle PageRankIncoming linksAuthority of linkersTaxonomies and external metadata

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • TF-IDF Ranking AlgorithmTerm frequency in the itemInverse document frequency of termRare words are likely to be more importantwij = weight of Term Tj in Document Ditfij = frequency of Term Tj in Document DjN = number of Documents in collectionn = number of Documents where term Tj occurs at least once

    From Salton 1989

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Other AlgorithmsVector spaceProbabilistic (binary interdependence)Fuzzy set theoryBayesian statistical analysisLatent semantic indexingNeural networksMachine learningAll require sophisticated queriesSee MIR, chapter 2

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Relevance HeuristicsHeuristics are rules of thumbNot algorithms, not mathSearch Relevance Ranking HeuristicsDocuments containing all search wordsSearch words as a phraseMatches in title tagMatches in other metadataBased on real-word user behavior

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Results InterfaceWhat users see after they click the Search buttonThe most visible part of searchElements of the results pagePage layout and navigationResults headerList of results itemsResults footer

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Many Experiments in Interface

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Back to Simplicity

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Suggestions (aka Best Bets)Human judgment beats algorithmsGreat for frequent, ambiguous searchesUse search log to identify best candidatesRecommend good starting pagesProduct information, FAQs, etc.Requires human resourcesThat means money and timeMore static than algorithmic search

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • MSU Keywords

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Siemens Results

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Cooks.com Results

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Salon.com Results

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Faceted Metadata Search & BrowseLeverage content structuredatabase fields (i.e. cruise amenities)document metadata (news article bylines)Provide both search and browseSupport information foragingIntegrate navigation with resultsNot just subject taxonomiesDisplay only fruitful paths, no dead endsSupported by academic researchMarti Hearst, UCB SIMS, flamenco.berkeley.edu

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Faceted Search: Information

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Faceted Search: Online Catalog

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Metrics and AnalyticsMetricsNumber of searchesNumber of no-matches searchesTraffic from search to high-value pagesRelate search changes to other metricsSearch Log AnalysisTop 5% searches: phrases and wordsTop no-matches searchesUse as market research

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Will Never Be PerfectSearch engines cant read mindsUser queries are short and ambiguousSome things will helpDesign a usable interface Show match words in contextKeep index current and completeAdjust heuristic weightingMaintain suggestions and synonymsConsider faceted metadata search

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

  • Search Engines, sorta Rocket ScienceQuestions and discussionContact [email protected] presentation: www.searchtools.com/slides/sims/202-04/

    UCB SIMS 202, Sept. 2004Avi Rappoport, Search Tools Consulting

    Adaptive PathConfidential 2001 Adaptive Path, LLC 2443 Fillmore Street #404 San Francisco, California 94115