webinar: solr's example/files: from bin/post to /browse and beyond
TRANSCRIPT
$ bin/post -c your_collection your_data/
https://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/
https://lucidworks.com/blog/2015/12/08/browse-new-improved-solr-5/
http://localhost:8983/solr/<collection>/browse
example/files
• Distilled, simple, document type navigation
• Multi-lingual, localizable interface
• Language detection and faceting
• Phrase/shingle indexing and "tag cloud" faceting
• E-mail address and URL index-time extraction
• "instant search" (as you type results)
$ bin/solr start
$ bin/solr create -c files -d example/files
$ bin/post -c files ~/Documents
$ open http://localhost:8983/solr/files/browse
quick start
URLs are UI too!• /browse is stateless
• all parameters for the view must be passed on the URL
• and/or in config:
• request handler definition (core reload API)
• paramsets / params.json (real-time API, no reload)
URL comparison• /browse?type=html&locale=de_DE
• /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on&facet.field={!ex=type}doc_type…
URL comparison: /browse• /browse?type=html&locale=de_DE
• /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on&facet.field={!ex=type}doc_type…
URL comparison: type• /browse?type=html&locale=de_DE
• /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on&facet.field={!ex=type}doc_type…
URL comparison: locale• /browse?type=html&locale=de_DE
• /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on&facet.field={!ex=type}doc_type…
Solr Tips and Tricks within• Indexing “pipeline”
• language detection
• document type identification
• e-mail address and URL extraction
• Top Phrases
• Query “pipeline”
• document type faceting and filtering
• UI localisation/localization
implementation: E-mail address and URL extraction
conf/email_url_types.txt
<URL>
<EMAIL>
/select?fl=id,email_ss,url_ss&wt=csv
conf/managed-schema:
conf/update-script.js:
implementation: document type faceting and filtering
&type=[doc|pdf|image|…|all|unknown]
fq={!switch v=$type tag=type
case=‘*:*'
case.all=‘*:*'
case.unknown='-doc_type:[* TO *]’
default=$type_fq}
type_fq={!field f=doc_type v=$type}
facet.field={!ex=type}doc_type f.doc_type.facet.mincount=0 f.doc_type.facet.missing=true facet.query={!ex=type key=all_types}*:*
example/files: what’s next?
• Fix e-mail and URL field names (<email>_ss and <url>_ss, with angle brackets in field names), also add display of these fields in /browse results rendering
• Improve quality of extracted phrases
• Extract, facet, and display acronyms
• Add sorting controls, possibly all or some of these: last modified date, created date, relevancy, and title
• Add grouping by doc_type perhaps
• fix debug mode - currently does not update the parsed query debug output (this is probably a bug in data driven /browse as well)
• Harden update-script: it currently errors if documents do not have a "content" field
• Filter out bogus e-mail addresses
https://issues.apache.org/jira/browse/SOLR-8590
And beyond…• Leveraging https://github.com/LucidWorks/fusion-solr-plugins
• Analytics
• Relevancy Tuning: signals feedback, parameter adjustments
• Landing pages, scripting, etc