search engine-building with lucene and solr
DESCRIPTION
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013. http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533TRANSCRIPT
![Page 1: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/1.jpg)
Search Engine-Building with Lucene and Solr
Kai ChanSoCal Code Camp, July 2013
![Page 2: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/2.jpg)
How to Search - One Approachfor each document d { if (query is a substring of d's content) { add d to the list of results }}sort the result (or not)
![Page 3: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/3.jpg)
How to Search - Problems
● slow○ reads the whole dataset for each search
● not scalable○ if you dataset grows by 10x,
your search slows down by 10x● how to show the most relevant documents
first?○ list of results can be quite long○ users have limited time and patience
![Page 4: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/4.jpg)
Inverted Index - Introduction
● like the "index" at the end of books● a map of one of the following types
○ term → document list○ term → <document, position> list
![Page 5: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/5.jpg)
documents:T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"
inverted index (without positions):"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
inverted index (with positions):"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}
Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
![Page 6: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/6.jpg)
Inverted Index - Speed
● term list○ typically very small○ grows slowly
● term lookup○ O(1) to O(log(number of terms))
● for a particular term○ document lists: very small○ document + position lists: still small
● few terms per query
![Page 7: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/7.jpg)
Inverted Index - Relevance
● information in the index enables:○ determination (scoring) of relevance of each
document to the query○ comparison of relevance among documents○ sorting by (decreasing) relevance
■ i.e. the most relevant document first
![Page 8: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/8.jpg)
Lucene v.s. Solr - Lucene
● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls
methods in the Lucene API● provides building blocks for custom features
![Page 9: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/9.jpg)
Lucene v.s. Solr - Solr
● full-text search server● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene
![Page 10: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/10.jpg)
machine running Java VM
your application
machine running Java VM
servlet container (e.g. Tomcat, Jetty)
SolrSolr code
Lucene code librariesindex
Lucene
Lucene code
indexlibraries
clientHTTP
Lucene:
Solr:
![Page 11: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/11.jpg)
Workflow
Setup
Indexing
Search
![Page 12: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/12.jpg)
Workflow
Setup
Indexing
Search
![Page 13: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/13.jpg)
Workflow - Setup
● servlet configuration○ e.g. port number, max POST size○ you can usually use the default settings
● Solr configuration○ e.g. data directory, deduplication, language
identification, highlighting○ you can usually use the default settings
● schema definition○ defines fields in your documents○ you can use the default settings if you name your
fields in a certain way
![Page 14: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/14.jpg)
How Data Are Organized
collection
document document document
field
field
field
field
field
field
field
field
field
![Page 15: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/15.jpg)
field
content (e.g. "please read" or 30)
name (e.g. "title" or "price")
type
options
![Page 16: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/16.jpg)
index
document document document
subject
date
from
subject
date
from
date
from
text text
reply-to
text
reply-to
![Page 17: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/17.jpg)
index
document document document
subject
date
from
title
SKU
price
last name
phone
text description
first name
address
![Page 18: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/18.jpg)
Solr Field Definition
● field○ name (e.g. "subject")○ type (e.g. "text_general")○ options (e.g. indexed="true" stored="true")
● field type○ text: "string", "text_general"○ numeric: "int", "long", "float", "double"
● options○ indexed: content can be searched○ stored: content can be returned at search-time○ multivalued: multiple values per field & document
![Page 19: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/19.jpg)
Solr Dynamic Field
● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued
name type indexed stored multiValued
*_i int true true false
*_l long true true false
*_f float true true false
*_d double true true false
*_s string true true false
*_ss string true true true
*_t text_general true true false
*_txt text_general true true true
![Page 20: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/20.jpg)
Solr Copy Field
● copy one or more fields into another field● can be used to define a catch-all field
○ source: "title", "author", "description"○ destination: "text"○ searching the "text" field has the effect of searching
all the other three fields
![Page 21: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/21.jpg)
Workflow
Setup
Indexing
Search
![Page 22: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/22.jpg)
Indexing - UpdateRequestHandler
● upload content or file to http://host:port/solr/update
● formats: XML, JSON, CSV
![Page 23: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/23.jpg)
XML:<add> <doc> <field name="id">apple</field> <field name="compName">Apple</field> <field name="address">1 Infinite Way, Cupertino CA</field> </doc> <doc> <field name="id">asus</field> <field name="compName">ASUS Computer</field> <field name="address">800 Corporate Way Fremont, CA 94539</field> </doc></add>
CSV:id,compName_s,address_sapple,Apple,"1 Infinite Way, Cupertino CA"asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
JSON:[ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"}]
![Page 24: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/24.jpg)
Indexing - DataImportHandler
● has its own config file (data-config.xml)● import data from various sources
○ RDBMS (JDBC)○ e-mail (IMAP)○ XML data locally (file) or remotely (HTTP)
● transformers ○ extract data (RegEx, XPath)○ manipulate data (strip HTML tags)
![Page 25: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/25.jpg)
Workflow
Setup
Indexing
Search
![Page 26: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/26.jpg)
Searching - Basics
● send request to http://host:port/solr/search● parameters
○ q - main query○ fq - filter query○ defType - query parser (e.g. lucene, edismax)○ fl - fields to return○ sort - sort criteria○ wt - response writer (e.g. xml, json)○ indent - set to true for pretty-printing
![Page 27: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/27.jpg)
http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json
search handler's URL main query
response writersort criteriafields to return
![Page 28: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/28.jpg)
Searching - Query Syntax - Field
● search a specific field○ field_name:value
● if field omitted, Solr uses default field:○ df parameter in URL○ defaultSearchField setting in schema.xml○ "text"
![Page 29: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/29.jpg)
Searching - Query Syntax - Term
● a term by itself: matches documents that contain that term○ e.g. tablet
![Page 30: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/30.jpg)
Searching - Query Syntax - Boolean
● boolean operators are supported○ AND &&○ OR ||○ NOT !
● e.g. a AND b○ all of a, b must occur
● e.g. a OR b○ at least one of a, b must occur
● e.g. a AND NOT b○ a must occur and b must not occur
![Page 31: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/31.jpg)
Searching - Query Syntax - Boolean
● Lucene/Solr's boolean operators are not true boolean operators
● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c
must occur● parentheses are supported
![Page 32: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/32.jpg)
Searching - Query Syntax - Boolean
● "+" prefix means "must"● "-" prefix means "must not"● no prefix means "at least one must"
(by default)○ e.g. a b c
■ at least one of a, b, c must occur● operators can mix
○ e.g. +a b c d -e■ a must occur■ at least one of b, c, d must occur■ e must not occur
![Page 33: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/33.jpg)
Searching - Query Syntax - Phrase
● phrases are enclosed by double-quotes● e.g. +"the phrase"
○ the phrase must occur● e.g. -"the phrase"
○ the phrase must not occur
![Page 34: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/34.jpg)
Searching - Query Syntax - Boost
● manually assign different weights to clauses● gives more weight to a field
○ e.g. title:a^10 body:a● gives more weight to a word
○ e.g. title:a title:b^10● gives phrases more weight than words
○ e.g. title:(+a +b) title:"a b"^10
![Page 35: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/35.jpg)
Searching - Query Syntax - Range
● matches field values within a range○ inclusive range - denoted by square brackets○ exclusive range - denoted by curly brackets
● e.g. age:[10 TO 20]○ matches the field "age" with the value in 10..20
● string or numeric comparison, depending on the field's type
![Page 36: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/36.jpg)
Searching - Query Syntax - EDisMax
● suitable for user-generated queries○ supports a subset of Lucene QP's syntax○ does not complain about the syntax○ searches for individual words across several fields
("disjunction")○ uses max score of a word in all fields for scoring
("max")● configurable (in solrconfig.xml)
○ what fields to search the words in○ weighting of these fields
![Page 37: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/37.jpg)
Sorting
● default: sorting by decreasing score● sorting by field: using the sort parameter
○ specify field name and order■ price asc - sort by "price" field, ascending■ price desc - sort by "price" field, descending
○ multiple fields and orders by comma■ starRating desc, price asc - sort by
"starRating" field, descending, and then by "price" field, ascending
○ cannot use multivalued fields○ overrides sorting by decreasing relevance
![Page 38: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/38.jpg)
Faceted Search
● facet values: (distinct) values (generally non-overlapping) ranges of a field
● displaying facets○ show possible values○ let users narrow down their searches easily
![Page 39: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/39.jpg)
facet
facet values (5 of them)
![Page 40: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/40.jpg)
Faceted Search
● set facet parameter to true - enables faceting
● other parameters○ facet.field - use the field's values as facets
■ return <value, count> pairs○ facet.query - use the given queries as facets
■ return <query, count> pairs○ facet.sort - set the ordering of the facets;
■ can be "count" or "index"○ facet.offset and face.limit - used for
pagination of facets
![Page 41: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/41.jpg)
Resources - Books
● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/
● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/
● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-
cookbook/book
![Page 42: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/42.jpg)
Resources - Books
● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/
● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/
![Page 43: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/43.jpg)
Resources - Web
● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/
● mailing lists● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/
● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide (LucidWorks) - http:
//lucene.apache.org/solr/tutorial.html
![Page 44: Search Engine-Building with Lucene and Solr](https://reader033.vdocuments.mx/reader033/viewer/2022052507/558df0c91a28ab357e8b47ea/html5/thumbnails/44.jpg)
Getting Started
● download Solr○ requires Java 6 or newer to run
● Solr comes bundled and configured with Jetty○ <Solr directory>/example/start.jar
● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar
● use the Solr admin interface○ http://localhost:8983/solr/