datascience meeting ii - roman kern - building an open source based search solution - first steps
DESCRIPTION
DataScience Talk by Roman Kern, Know Center - Graz University of Technology Date: April 12th 2012 Graz, AustriaTRANSCRIPT
Building an open-source based search solution –first steps
Roman Kern
Institute of Knowledge ManagementGraz University of Technology
Know-Center [email protected], [email protected]
Data Science Meetup / 2012-04-12
Graz University of TechnologyOverview
Motivation
Background
Solr Ecosystem
Solr Features
Conclusions
2 / 28
Graz University of TechnologyMotivation
Search
I Change in users expectations
I Missing, sub-optimal search causes frustration
Science
I Information retrieval
I Success story
I Mostly focused on web search
Industry
I Enterprise search
I Heterogeneous data sources
3 / 28
Graz University of TechnologyBackground of the Speaker
http://a1.net
http://wissen.de
4 / 28
Graz University of TechnologyApache Lucene Umbrella Project
Components
I Search engine ⇒ Lucene
I Search server ⇒ Solr
I Web search engine ⇒ Nutch
I Lightweight crawler ⇒ Droids
I File-format parsing ⇒ Tika
I Communicate with CMS ⇒ ManifoldCF
I Distributed coordination ⇒ ZooKeeper
I Natural language processing ⇒ OpenNLP
I Related projects: Hadoop, Mahout, Carrot2, ...
Common aspects
Apache license, implemented in Java, community
5 / 28
Graz University of TechnologyLucene
Search Engine Library
I Java APII Only for expert users
I Search-IndexI File-systemI In-memory index
I Advanced featuresI Incremental indexingI Update while searching
I Base for many projectsI SolrI ir-libI elasticsearch
I LIA (Lucene in Action)
http://lucene.apache.org/core/ 6 / 28
Graz University of TechnologyNutch
Web search engine
I Builds upon SolrI Web crawler
I Link database, crawl database
I DistributedI Runs on Hadoop
I Mode of operationI Crawl a single domainI Crawl the web with seed sites
http://nutch.apache.org/
7 / 28
Graz University of TechnologyDroids
Crawler component
I Lightweight crawlerI Main features
I ThrottlingI Multi-threadedI Well behaved (robots.txt)
http://incubator.apache.org/droids/
8 / 28
Graz University of TechnologyTika
Text extraction
I Text & meta-dataI File-formats
I OfficeI Microsoft Formats (Apache POI)I OpenDocument
I Common text formatsI PDF (PDFBox)I HTML (tagsoup)
I Non-textI ImagesI Sound
http://tika.apache.org/
9 / 28
Graz University of TechnologyManifoldCF
Content Management System Connectors
I Communicate with CMS/DMSI Connectors
I FileNet P8 (IBM)I Documentum (EMC)I LiveLink (OpenText)I Meridio (Autonomy)I Windows shares (Microsoft)I SharePoint (Microsoft)I More: Alfresco, JDBC, ...
I Data is then stored and indexedI e.g. Solr
http://incubator.apache.org/connectors/
10 / 28
Graz University of TechnologyZooKeeper
Distributed coordination
I Orchestrate serversI Distributed
I ConfigurationI Name lookupI Synchronization
http://zookeeper.apache.org/
11 / 28
Graz University of TechnologyOpenNLP
Natural language processing
I Process plain text
I Maximum entropy classification with beam searchI Models
I Sentence splittingI Token splittingI Part-of-speech (POS) taggingI Named entity recognitionI more: chunker, parser, co-reference resolution
http://opennlp.sourceforge.net/
12 / 28
Graz University of TechnologyHadoop
Distributed computing
I Scale out frameworkI Distributed file-system
I Data is partitionedI Stored on multiple nodes
I Map/Reduce paradigmI Map your algorithms to mappers & reducers
Related projects: HBase, Pig, Hive, ...
http://hadoop.apache.org/
13 / 28
Graz University of TechnologyMahout
Distributed machine learning
I Scale out frameworkI Machine learning
I Recommender systemsI ClusteringI Classification
I IntegrationI StandaloneI HadoopI Amazon EC2
http://mahout.apache.org/
14 / 28
Graz University of TechnologyDetails
15 / 28
Graz University of TechnologySearch Server
What Solr is
I Web-Service
I Full-text indexing & search
I Support to store arbitrary content
What Solr isn’t
I Solr 6= grepI Database
I But, somehow similar to No-SQL databases
Solr vs. IR-Lib
I Solr: easy to use, easy to integrate, XML configuration
I IR-Lib: expert knowledge to use, Java configuration, fast
16 / 28
Graz University of TechnologyIndex Structure
Inverted Index
I Dictionary of words (terms)
I Map from term to document
Document
I List of fields
I Input fields are them mapped according to the schema
Field-types
I Defined in the schema
I Type (string, boolean, date, number) - internally mapped tostring
17 / 28
Graz University of TechnologyIndex Management
API
I HTTP Server
I Various formats (XML, binary, JavaScript, ...)
Document life-cycle
I There is no update
I Delete (done automatically by Solr)
I InsertI Implications
I An unique id is necessaryI Use batch updates
I Commit, rollback (and optimize)
18 / 28
Graz University of TechnologyInput Handling
Different input formats
I XML
I CSVI JDBC (database)
I DIH (data import handler)I Support incremental updates (via timestamps)
I Solr CellI Binary contentI Apache TikaI Text content and metadata
19 / 28
Graz University of TechnologyText Processing
Scope
I During indexing & query
Tokenization
I Split text into tokens
I Lower-case alignment
I Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒triplic, ...)
I Synonyms (via Thesaurus)
I Stop-word filtering
I Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
I n-grams, soundex, umlauts
20 / 28
Graz University of TechnologyQuery Processing
Query parsers
I Lucene query parser (rich syntax)I AND, OR, NOT, range queries, wildcards, fuzzy query, phrase
queryI Boosting of individual partsI Example: ((boltzmann OR schroedinger) NOT einstein)
I Dismax query parserI No query syntaxI Searches over multiple fields (separate boost for each field)I Configure the amount of terms to be mandatoryI Distance between terms is used for ranking (phrase boosting)
Dismax is a good starting point, but may become expensive
21 / 28
Graz University of TechnologySearch Features
Query filter
I Additional query
I No impact on ranking
I Results are cached
Boosting query
I Only in Dismax
Query elevation
I Fix certain queries
Request handler
I Pre-define clauses
I Invariants
Function queries
I Score is computed on field values
22 / 28
Graz University of TechnologySearch Result
Ranking
I Relevance
I Sort on field value (only single term per document)
Available data & features
I Sequence of IDs & score
I Stored fields
I Snippets (plus highlighting)I Facets
I Count the search hitsI Types: field value, dates, queriesI Sort, prefix, ...I Could be used for term suggestion (aka. query suggestion)
I Field collapsing (grouping)
I Spell checking (did-you-mean)23 / 28
Graz University of TechnologyAdditional Solr Features
Query by Example
I More like this
Stats
I Per field
I Min, max, sum, missing, ...
Admin-GUI
I Webapp to troubleshoot queries
I Browse schema
JMX
I Read properties & statistics
I Can be accessed remotely
24 / 28
Graz University of TechnologyIntegration
Deployment
I Within a web application server
I Embedded
Monitor
I Log output
Access
I Various language bindings
I Java, Ruby, JavaScript, PHP, ...
25 / 28
Graz University of TechnologyMulti-core
Multiple indices
I Each index has its own configuration
Operations
I Reload (when configuration has been changed)
I Rename
I Swap
I Merge
I Create, Status
26 / 28
Graz University of TechnologyScale Solr
Replication
I Master and slaves nodes
I Replication
I Slaves poll master
Dispatch search request
I Load balancer
27 / 28
Graz University of TechnologySharding Indexes
Single index
I Index spawned over multiple machines
I Search is done in parallel
Mapping
I Application has to provide a deterministic mapping
I Document ⇒ index
28 / 28
Graz University of TechnologyConclusions
Ecosystem
I Vivid community
I Corporative backing
Solr
I Easy to get started
I Hard to optimize for specific requirements
29 / 28
Graz University of TechnologyThe End
Thank you!
30 / 28