understanding and visualizing solr explain information - rafal kuc
DESCRIPTION
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 Talk and presentation about how to use, understand and visualize Solr 'explain' information—essential output from Solr that lets you better tune and debug your search application. In the talk, I'll show the free software that is in development right now, that visualize Solr 'explain' information, such as how the score of the documents were counted, from what it is taken, how it was counted,which tokens mattered the most, and so on.TRANSCRIPT
Understanding and visualisingSolr explain information
Rafał Kuć, Marek Rogoziński, [email protected], [email protected], 18.10.2011
My Background
� Rafał Kuć• Working with Lucene since 2002• Working with Solr since 2007
� Solr.pl• Co – founder (with Marek Rogozi ńńńński)
� Area of expertise• Lucene and Solr consultant and architect in
many major e-commerce sites in Poland• Author of „Solr 3.1 cookbook” by Packt
Publishing• Father, husband, Starcraft II player and a
gardener after hours ☺
3
What I Will Cover
� Understanding and visualising Solr explaininformation
� How to make the information given by Apache Solr explain easily readable by a Solr user (not much technical one)
� Context• Complicated explain made simple• Explain other made even simpler
� What’s next to come
4
A typical use case
The Challenge
� Common questions like:• Why this document was found ?• Why this document wasn’t found ?• Why this document is higher than the other one ?• Why the results list look like this ?
� Considerations• Do we always have to anwser those questions ?
� So how to make users get the answers they want ?• That’s how http://explain.solr.pl was born
6
Let’s look at a typical example
� You run a query• q=ddr&defType=dismax&qf=name^1000+description^100&bf
=pow(price,1.5)&debugQuery=true&indent=true
� And you see the explain information
7
1.6771803 = (MATCH) sum of: 0.64883727 = (MATCH) max of:
0.64883727 = (MATCH) weight(name:ddr^1000.0 in 6), product of:0.99999994 = queryWeight(name:ddr^1000.0), product of:
1000.0 = boost2.446919 = idf(docFreq=3, maxDocs=17) 4.0867718E-4 = queryNorm
0.6488373 = (MATCH) fieldWeight(name:ddr in 6), product of: 1.4142135 = tf(termFreq(name:ddr)=2) 2.446919 = idf(docFreq=3, maxDocs=17) 0.1875 = fieldNorm(field=name, doc=6)
1.028343 = (MATCH) FunctionQuery(pow(float(price),const(1.5))), product of: 2516.272 = pow(float(price)=185.0,const(1.5)) 1.0 = boost4.0867718E-4 = queryNorm
Some theory
� tf – term’s frequency
� df – document frequency� idf – inverse document frequency
� norm – normalization factor• queryNorm – query normalization factor• fieldNorm – field normalization factor
� coord – score factor
8
Let’s take a look at it again1.6771803 = (MATCH) sum of:
0.64883727 = (MATCH) max of:
0.64883727 = (MATCH) weight(name:ddr^1000.0 in 6), product of:
0.99999994 = queryWeight(name:ddr^1000.0), product of:
1000.0 = boost
2.446919 = idf(docFreq=3, maxDocs=17)
4.0867718E-4 = queryNorm
0.6488373 = (MATCH) fieldWeight(name:ddr in 6), product of:
1.4142135 = tf(termFreq(name:ddr)=2)
2.446919 = idf(docFreq=3, maxDocs=17)
0.1875 = fieldNorm(field=name, doc=6)
1.028343 = (MATCH) FunctionQuery(pow(float(price),const(1.5))), product of:
2516.272 = pow(float(price)=185.0,const(1.5))
1.0 = boost
4.0867718E-4 = queryNorm
A little more complicated example36.50278 = (MATCH) sum of:
1.54896 = (MATCH) sum of:0.46676102 = (MATCH) max of:0.46676102 = (MATCH) weight(name:hard^20.0 in 2), product of:
0.5461986 = queryWeight(name:hard^20.0), product of:20.0 = boost2.734601 = idf(docFreq=2, maxDocs=17)0.009986806 = queryNorm
0.8545628 = (MATCH) fieldWeight(name:hard in 2), product of:1.0 = tf(termFreq(name:hard)=1)2.734601 = idf(docFreq=2, maxDocs=17)0.3125 = fieldNorm(field=name, doc=2)
0.46676102 = (MATCH) max of:0.46676102 = (MATCH) weight(name:drive^20.0 in 2), product of:
0.5461986 = queryWeight(name:drive^20.0), product of:20.0 = boost2.734601 = idf(docFreq=2, maxDocs=17)0.009986806 = queryNorm
0.8545628 = (MATCH) fieldWeight(name:drive in 2), product of:1.0 = tf(termFreq(name:drive)=1)2.734601 = idf(docFreq=2, maxDocs=17)0.3125 = fieldNorm(field=name, doc=2)
0.61543787 = (MATCH) max of:
0.098470055 = (MATCH) weight(manu:maxtor in 2), product of:0.03135923 = queryWeight(manu:maxtor), product of:3.1400661 = idf(docFreq=1, maxDocs=17)0.009986806 = queryNorm
3.1400661 = (MATCH) fieldWeight(manu:maxtor in 2), product of:1.0 = tf(termFreq(manu:maxtor)=1)3.1400661 = idf(docFreq=1, maxDocs=17)1.0 = fieldNorm(field=manu, doc=2)
0.61543787 = (MATCH) weight(name:maxtor^20.0 in 2), product of:0.6271846 = queryWeight(name:maxtor^20.0), product of:20.0 = boost3.1400661 = idf(docFreq=1, maxDocs=17)0.009986806 = queryNorm
0.9812707 = (MATCH) fieldWeight(name:maxtor in 2), product of:1.0 = tf(termFreq(name:maxtor)=1)3.1400661 = idf(docFreq=1, maxDocs=17)0.3125 = fieldNorm(field=name, doc=2)
34.95382 = (MATCH) FunctionQuery(float(price)), product of:350.0 = float(price)=350.010.0 = boost0.009986806 = queryNorm
And now , a real life example1.6287426 = (MATCH) sum of:
0.8143703 = (MATCH) sum of:0.40718514 = (MATCH) max plus 0.01 times others of:4.154771E-7 = (MATCH) weight(description_nostemm:harry^10.0 in 36647), product of:4.4066886E-7 = queryWeight(description_nostemm:harry^10.0), product of:10.0 = boost7.5426636 = idf(docFreq=796, maxDocs=553224)5.8423506E-9 = queryNorm
0.94283295 = (MATCH) fieldWeight(description_nostemm:harry in 36647), product of:1.0 = tf(termFreq(description_nostemm:harry)=1)7.5426636 = idf(docFreq=796, maxDocs=553224)0.125 = fieldNorm(field=description_nostemm, doc=36647)
0.40718514 = (MATCH) weight(category_search:harri^2000000.0 in 36647), product of:0.123389944 = queryWeight(category_search:harri^2000000.0), product of:2000000.0 = boost10.559957 = idf(docFreq=38, maxDocs=553224)5.8423506E-9 = queryNorm
3.2999864 = (MATCH) fieldWeight(category_search:harri in 36647), product of:1.0 = tf(termFreq(category_search:harri)=1)10.559957 = idf(docFreq=38, maxDocs=553224)0.3125 = fieldNorm(field=category_search, doc=36647)
5.976383E-8 = (MATCH) weight(description:harri in 36647), product of:4.2931266E-8 = queryWeight(description:harri), product of:7.348286 = idf(docFreq=967, maxDocs=553224)5.8423506E-9 = queryNorm
1.3920817 = (MATCH) fieldWeight(description:harri in 36647), product of:1.7320508 = tf(termFreq(description:harri)=3)7.348286 = idf(docFreq=967, maxDocs=553224)0.109375 = fieldNorm(field=description, doc=36647)
0.40718514 = (MATCH) max plus 0.01 times others of:5.0300997E-7 = (MATCH) weight(description_nostemm:potter^10.0 in 36647), product of:4.84872E-7 = queryWeight(description_nostemm:potter^10.0), product of:10.0 = boost8.299262 = idf(docFreq=373, maxDocs=553224)5.8423506E-9 = queryNorm
1.0374078 = (MATCH) fieldWeight(description_nostemm:potter in 36647), product of:1.0 = tf(termFreq(description_nostemm:potter)=1)8.299262 = idf(docFreq=373, maxDocs=553224)0.125 = fieldNorm(field=description_nostemm, doc=36647)
0.40718514 = (MATCH) weight(category_search:Potter^2000000.0 in 36647), product of:0.123389944 = queryWeight(category_search:Potter^2000000.0), product of:2000000.0 = boost10.559957 = idf(docFreq=38, maxDocs=553224)5.8423506E-9 = queryNorm
3.2999864 = (MATCH) fieldWeight(category_search:Potter in 36647), product of:1.0 = tf(termFreq(category_search:Potter)=1)10.559957 = idf(docFreq=38, maxDocs=553224)0.3125 = fieldNorm(field=category_search, doc=36647)
5.7398886E-8 = (MATCH) weight(description:Potter in 36647), product of:4.656172E-8 = queryWeight(description:Potter), product of:7.9696894 = idf(docFreq=519, maxDocs=553224)5.8423506E-9 = queryNorm
1.2327484 = (MATCH) fieldWeight(description:Potter in 36647), product of:1.4142135 = tf(termFreq(description:Potter)=2)7.9696894 = idf(docFreq=519, maxDocs=553224)0.109375 = fieldNorm(field=description, doc=36647)
1.8327936E-6 = (MATCH) max plus 0.01 times others of:1.8327936E-6 = (MATCH) weight(description_nostemm:"harry potter"~100^10.0 in 36647), product of:9.255408E-7 = queryWeight(description_nostemm:"harry potter"~100^10.0), product of:10.0 = boost15.841926 = idf(description_nostemm: harry=796 potter=373)5.8423506E-9 = queryNorm
1.9802407 = fieldWeight(description_nostemm:"harry potter" in 36647), product of:1.0 = tf(phraseFreq=1.0)15.841926 = idf(description_nostemm: harry=796 potter=373)0.125 = fieldNorm(field=description_nostemm, doc=36647)
0.81437016 = (MATCH) sum of:0.40718508 = (MATCH) weight(category_the:harri in 36647), product of:0.12338993 = queryWeight(category_the:harri), product of:10.559957 = idf(docFreq=38, maxDocs=553224)0.011684701 = queryNorm
3.2999864 = (MATCH) fieldWeight(category_the:harri in 36647), product of:1.0 = tf(termFreq(category_the:harri)=1)10.559957 = idf(docFreq=38, maxDocs=553224)0.3125 = fieldNorm(field=category_the, doc=36647)
0.40718508 = (MATCH) weight(category_the:Potter in 36647), product of:0.12338993 = queryWeight(category_the:Potter), product of:10.559957 = idf(docFreq=38, maxDocs=553224)0.011684701 = queryNorm
3.2999864 = (MATCH) fieldWeight(category_the:Potter in 36647), product of:1.0 = tf(termFreq(category_the:Potter)=1)10.559957 = idf(docFreq=38, maxDocs=553224)0.3125 = fieldNorm(field=category_the, doc=36647)
3.394099E-7 = (MATCH) FunctionQuery(pow(int(sold),const(1.5))), product of:58.09475 = pow(int(sold)=15,const(1.5))1.0 = boost5.8423506E-9 = queryNorm
Let’s visualize now
History view
Basic information
The real thing
Even more ☺
What if we can ’t match ?
And the no-matched explain
What you gain from explain.solr.pl
� View Solr explain information in a humanreadable form
� Easily recognize the most influencing elementsof the scoring process
� Answer the questions faster� More things to come in the future
19
Plans for the future
� Support for more formats of Apache Solrexplain (right now, only Solr 3.x is supported)
� Visualisation of additional data� More functionalities like:
• query problems analysis• query syntax analysis and explanation• query time analysis and visualization• result comparison between cores or instances
� Very distant future - additional web applicationdeployed along Solr to enable real timeanalysis of boosts influence
Wrap Up
� The http://explain.solr.pl should be availablevery soon (probably end of October or midNovember)
� Code of explain.solr.pl will be available on GitHub soon after the initial release
� There will be a Java version of thehttp://explain.solr.pl which will cover much moreinformation
21
Sources
� Links• http://www.solr.pl• http://explain.solr.pl• http://lucene.apache.org ☺
� We would like to thank:• ŁŁŁŁukasz Lewandowski ( http://llewandowski.pl/ ) for
his work on the GUI • Hubert ‘depesz’ Lubaczewski ( http://depesz.com )
for idea ☺
22
Contact
� Rafał Kuć• [email protected]• http://solr.pl
� Marek Rogoziński• [email protected]• http://solr.pl
23
Thank you