an introduction to basics of search and relevancy with apache solr
DESCRIPTION
The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/TRANSCRIPT
Introduction to basics of Search and Relevancy with Apache Solr
Mark Bennett, CTO
FEATURING:
Lucid Imagination, Inc.12/2/2009
Agenda
• Prerequisites: Browser Tricks
• Web “Command Line”
• The DisMax Parser
• Boosting Formula
• Explaining “Explain”
• Check Your Index!
• Q & A
• Resources / About NIE
2
Lucid Imagination, Inc.12/2/2009
Prerequisite: Some Browser Tricks
3
Lucid Imagination, Inc.12/2/2009
Browsers Matter – install them all!
• Default XML Rendering
• (also some versions of IE)
• Lots of Plugins
• Better “Explain” copy & paste
maintains line breaks
• Better table copy and paste
Firefox: IE and Safari:
4
Lucid Imagination, Inc.
Larger Firefox “Command Line”
Customize the Firefox URL box as a commandline in 3 easy steps
1. Toolbar: Right Click
2. Customize… Add New Toolbar
3. URL bar ->CLICK and DRAG
5
Lucid Imagination, Inc.12/2/2009
Turn off Solr HTTP Caching
• Change in solrconfig.xml
• Disable the http304 section
• Turn it back on before you deploy!
6
Lucid Imagination, Inc.12/2/2009
Understanding Solr’s“Web Command Line”
7
Lucid Imagination, Inc.12/2/2009
The “Web Command Line”
• Command Prompt
• -o or --foo bar
• (spaces)
• some punctuation
• output
• Command line “adapter”
• Script files can call URLs
• Not built into Windows – try cygwin
CLI CONCEPT SOLR EQUIVALENT
8
URL bar
XML or HTML
? or & and =
+
%nn
Curl
Lucid Imagination, Inc.12/2/2009
Solr “Command Line”
• Typical Base URL
• http://localhost:8983/solr/select?...
• Basic Input (not counting dismax)
• q = query, fq = filter query
• df = default field
• qt = query type (standard / dismax)
• Controlling Output (lots more!!!)
• debugQuery = true
• wt = “what type” (actually “writer type”)
• standard/XML, xslt (with tr=), javabin, json…
• fl = *,score (which fields)
9
Lucid Imagination, Inc.12/2/2009
Example: search for “solr”
http://localhost:8983/solr/select?q=solr&debugQuery=true
* Some versions
With Firefoxyou get XML output you can expand and collapse
With MSIE* and Safari, not so much
10
Lucid Imagination, Inc.12/2/2009
Detailed Debug & Explain Output
http://localhost:8983/solr/select?q=solr&debugQuery=true
<str name="parsedquery">text:solr</str> …
<lst name="explain">
<str name="SOLR1000">
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
</str>
</lst>
11
Lucid Imagination, Inc.12/2/2009
A look at the DisMax query parser
12
Lucid Imagination, Inc.12/2/2009
Solr DisMax: Defined
• What is it?
• Dis-joint text (Multiple fields)
• Max-imum match (score)
• How do you get it?
• Configured in:
• solrconfig.xml and schema.xml
• Called with:
• qt=dismax
• Adjusted with:
• mm, bf, qf, pf, qs, ps, tie
13
Lucid Imagination, Inc.
Solr DisMax: Pros and Cons
General Benefits
• Multiple Fields
• Multiple Relevancy Rules
• Great for Freshness / Popularity
Issues to be Aware of
• Tie-in between schema.xml & solrconfig.xml
• Trouble with some CJK (Chinese, Japanese, Korean)
• Limited wildcard / field / range support
• Difficult to customize and debug
• Trouble with shingles
• Understand mm!
14
Lucid Imagination, Inc.
About the “dis” and the “max”
Distributed across multiple fields
• Breakup query into words
• Each part becomes field clause
• Like an OR but with extra credit
Takes the Maximum of each set
• Word 1 had highest score in Title
• Word 2 very dense in the doc body
• Adds in Tie breaker if in multiple fields
15
Lucid Imagination, Inc.
Coming soon: Extended DisMax
Improvements
• Flexible case Boolean ops: AND/and, OR/or
• Auto-escape punctuation & -> \&, etc.
• Improved Proximity Boosting (via word bigrams)
• Other changes in stop words, relevancy calc, URL arguments
How to get it
• Post 1.4 patch, planned for 1.5
• Details + Patch in JIRA: SOLR-1553
http://issues.apache.org/jira/browse/SOLR-1553
• TBD: change URL option qt=edismax (or qt=dismax )
16
Lucid Imagination, Inc.12/2/2009
Boosting Formulas
17
Lucid Imagination, Inc.
Boost Functions in Dismax
High Level Feature
• Numeric functions for scoring
• sum(), product(), sqrt(), log(), etc.
• Boost on recent dates, user popularity
Good Combination: Reverse-Ordinal & Reciprocal
• Position in index : ord(), reverse is: rord()
• Larger y for smaller x: recip()
How to get it
• URL parameter bf = “boost function”
• Configured in solrconfig.xml
• See http://wiki.apache.org/solr/FunctionQuery
18
Lucid Imagination, Inc.
“Freshness”: Boosting Recent Datesm x + c a / mx+c
DatePosition
ord()N-Position
rord()Linear
(x,m,c) recip(x,m,a,c)
1/1/2000 1 120 1120 0.89286
2/1/2000 2 119 1119 0.89366
3/1/2000 3 118 1118 0.89445
… … … … …
1/1/2005 61 60 1060 0.94340
… … … … …
1/1/2009 109 12 1012 0.98814
2/1/2009 110 11 1011 0.98912
3/1/2009 111 10 1010 0.99010
4/1/2009 112 9 1009 0.99108
5/1/2009 113 8 1008 0.99206
6/1/2009 114 7 1007 0.99305
7/1/2009 115 6 1006 0.99404
8/1/2009 116 5 1005 0.99502
9/1/2009 117 4 1004 0.99602
10/1/2009 118 3 1003 0.99701
11/1/2009 119 2 1002 0.99800
12/1/2009 120 1 1001 0.99900
WIKI EXAMPLE:recip( rord(creationDate), 1, 1000, 1000 )
slope m 1
numerator a 1000
intercept c 1000 (aka "b")
0.880
0.900
0.920
0.940
0.960
0.980
1.000
19
Lucid Imagination, Inc.12/2/2009
Sifting throughSolr’s “Explain” output
20
Lucid Imagination, Inc.12/2/2009
DisMax Example for “solr”
<str name="parsedquery">
+DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3)
</str>
INPUT:
DEBUG OUTPUT: (1 OF 2)
http://localhost:8983/solr
/select?q=solr&debugQuery=true&qt=dismax
21
Lucid Imagination, Inc.12/2/2009
DisMax explain output for a single word query
<lst name="explain"><str name="SOLR1000">
0.74609417 = (MATCH) sum of:0.4476144 = (MATCH) max plus 0.01 times others of:0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:0.04119147 = queryWeight(text:solr^0.5), product of:0.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=text, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:0.09885953 = queryWeight(name:solr^1.2), product of:1.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)
0.03710002 = (MATCH) weight(features:solr in 13), product of:0.08238294 = queryWeight(features:solr), product of:3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:1.0 = tf(termFreq(features:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)
1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)
0.22311316 = (MATCH) max plus 0.01 times others of:0.040810023 = (MATCH) weight(features:solr^1.1 in 13),
product of:0.09062123 = queryWeight(features:solr^1.1), product of:1.1 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:
1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of:0.016476588 = queryWeight(text:solr^0.2), product of:0.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)0.22260013 = (MATCH) weight(name:solr^1.5
in 13), product of:0.12357441 = queryWeight(name:solr^1.5),
product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solrin 13), product of:
1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)
0.06860119 = (MATCH) FunctionQuery(top(ord(popularity))), product of:
6.0 = ord(popularity)=60.5 = boost0.022867065 = queryNorm
0.0067654043 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(price)))+1000.0)), product of:
0.9861933 = 1000.0/(1.0*float(rord(price)=14)+1000.0)
0.3 = boost0.022867065 = queryNorm
</str></lst>
22
Lucid Imagination, Inc.12/2/2009
“Explain” example:
...
0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:
0.04119147 = queryWeight(text:solr^0.5), product of:
0.5 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:
0.09885953 = queryWeight(name:solr^1.2), product of:
1.2 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:
1.0 = tf(termFreq(name:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.5 = fieldNorm(field=name, doc=13)
0.03710002 = (MATCH) weight(features:solr in 13), product of:
0.08238294 = queryWeight(features:solr), product of:
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:
1.0 = tf(termFreq(features:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=features, doc=13)
...
tf (termFreq(text:solr )=2)idf (docFreq=1,numDocs=26)
23
Lucid Imagination, Inc.12/2/2009
Solr’s XSLT “debugger”http://localhost:8983/solr/select?
q=solr
&debugQuery=true
&wt=xslt
&tr=example.xsl
&fl=*,score
&qt=dismax
24
Lucid Imagination, Inc.
Another way to view Explain data
• Solr1.4 has Solritas
• Various features, including toggle explain display
• “Some assembly required…”
http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/
25
Lucid Imagination, Inc.12/2/2009
Checking your Index and IDF
26
Lucid Imagination, Inc.
Checking what got Indexed
Bad Index = Bad Search
• Check Upper / lower case and Punctuation
• Bad Fields / Meta Data = Bad Facets, Filters, Sorting
Use built-in Schema Browser:
• Check each field
• Common words =
• IDF “Inverse Document Frequency”
27
Lucid Imagination, Inc.
Check IDF w/ the Schema Browser
Start at the Admin Screen:
Schema Browser
• select a field
• change # to see more
http://localhost:8983/solr/admin
Lucid Imagination, Inc.12/2/2009
New Idea Engineering
About NIE
29
Lucid Imagination, Inc.12/2/2009
NIE Resources
Search Dev Newsgroup:www.SearchDev.org
Newsletter & Whitepapers:www.ideaeng.com/current
EnterpriseSearchBlog.comBlogs:
SearchComponentsOnline.com
30
Lucid Imagination, Inc.12/2/2009
Finish Line / Q & A
Review & Questions
Mark Bennett [email protected]
main 408-446-3460
cell 408-829-6513
31
Lucid Imagination, Inc.12/2/2009
Q & A
These slides and a recorded presentation are available at
bit.ly/SolrRelevancy