opencms days 2014 - using the solr collector
TRANSCRIPT
![Page 1: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/1.jpg)
Sören Schneider, Alkacon Software
WORKSHOP TRACK
Using the SOLR Collector
27.11.2014
![Page 2: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/2.jpg)
1. Brief Introduction Into Solr
2. Common Mistakes Using OpenCms & Solr
3. Using the Solr Collector (DEMO)
4. Spellchecking in OpenCms Using Solr
Agenda
![Page 3: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/3.jpg)
● Solr is a very versatile and powerfool search
engine that supports various features
● This functionality comes with the price of
increased complexity to handle Solr
● Many customizations available
● All fields composing a single document are typed
Brief Solr Introduction
![Page 4: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/4.jpg)
● Data structures of Solr‘s documents are
defined the file schema.xml
● Performing changes on this file requires reindexing
● Dynamic Fields cope with that limitiation
● Can be used without being explicitely defined in
the schema using wildcards
Defining Solr‘s Data Structure
![Page 5: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/5.jpg)
Solr: Indexing Content
a: date
b: text
c: string
Solr processing
(through
analyzers, filters
and tokenizers)
a: date
b: string
c: string
![Page 6: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/6.jpg)
● „Direct“ usage of OpenCms & Solr requires a
basic understanding of Solr
● Use proper datatypes in respect of individual
usecase, gain knowledge of filters
● Know the query syntax (for appropriate datatypes)
● Most common mistakes of OpenCms users
result in insufficient knowledge of Solr basics
OpenCms & Solr
![Page 7: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/7.jpg)
1. Using inproper types
● „text“ vs „string“
● Formulating correct queries
2. Issues regarding mapping OpenCms <->Solr
3. (Encoding Problems)
Common Mistakes Using Solr &
OpenCms
![Page 8: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/8.jpg)
● String
● Stores its content as exact string
● No tokenization / processing is being performed
● Useful when searching for exact value
● Text
● Tokenization and processing is performed
● Useful when a part of the content is searched for
„text“ vs „string“
![Page 9: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/9.jpg)
● OpenCms‘s copies the entire XML content into
a single(!) locale-aware Solr field of type „text“
for each locale
● Particular information of a resource is made
searchable in OpenCms using two approaches
● Automatic mapping of properties to Solr fields
● Manual definintion of mappings
Making Your Content Searchable
![Page 10: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/10.jpg)
Indexing Content w/o
Searchsettings
Solr processing
(through analyzers,
filters and tokenizers)
x: text a: date
b: string
c: string
![Page 11: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/11.jpg)
Indexing Content with
Searchsettings
a: date
b: text
c: string
Solr processing
(through analyzers,
filters and tokenizers)
a: date
b: string
c: string
![Page 12: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/12.jpg)
● Mapping happens in the scheme of the
appropriate resource type
● Excerpt
Solr – OpenCms Interaction:
Mapping
<xsd:schema
…
<xsd:annotation
<xsd:appinfo
<searchsettings>
<searchsetting element= "City" searchcontent="true">
<solrfield targetfield= "city" sourcefield="_s"
</searchsetting> …
Resource type
element name
![Page 13: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/13.jpg)
Element Mapping Attributes
Attribute Name Effect on the Solr Field
targetfield* The resulting name
locale Write content only for specific locale
sourcefield Defines the resulting type
copyfields Copies the value to a different field
default Sets a default value
boost Sets a boost for the field
![Page 14: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/14.jpg)
● Users complain about problems regarding
certain Characters – mostly German Umlauts –
in Solr results
● In nearly all cases the sole problem lies within the
integration of Solr to the servlet cotainer which is
not happening in UTF-8
● Extra note for Tomcat users: Please check
whether you appended the required attributes
all appropriate „<Connector>“s ;-)
Using UTF-8 in Solr
![Page 15: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/15.jpg)
● Live Demo
15
Live Demo
Demo
Demo Demo
Demo
デモ
![Page 16: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/16.jpg)
WYSIWYG Spellchecker
![Page 17: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/17.jpg)
● The Spellchecker has been realized using Solr
● Solr already provides a flexible component named
„SpellCheckComponent“
● This component supports inline spellchecking of
Solr queries
● Source for suggestions can be specified by Solr
fields or text files
WYSIWIG Spellchecker
![Page 18: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/18.jpg)
● The „SpellCheckComponent“ is widely used to
implement the „Did you mean?“-feature known
by popular search engines
● The component is
● Reliable and mature
● Fast
● Plus, Solr is already available in OpenCms
Why using Solr as Spellchecker
![Page 19: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/19.jpg)
● If both usecases use the same component,
how do the implementations actually differ?
● „Did you mean?“ builds source of suggested words
based on the entire data, the search runs on.
Usually only a single hit is returned.
● The WYSIWYG spellchecker builds ist source of
suggestions based on a data that solely contains
the dictionary for a single language
Differences Between Usecases in
Regards of Implementation
![Page 20: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/20.jpg)
● Spellchecking has been realized using another Solr
core that resides in WEB-INF/spellcheck
● As the only purpose of this core is to contain spellcheck
information, the schema.xml file is as simple as it gets
● Why using another Solr core instead of the default core
that‘s used by OpenCms?
● Dictionaries are stored as one Solr index per
language
How to model this scenario using
Solr?
![Page 21: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/21.jpg)
● Sadly, the spellchecking interfaces of tinyMCE
and Solr are incompatible
Problems regarding tinyMCE and
Solr
Solr
tinyMCE
![Page 22: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/22.jpg)
Comparison Spellcheck Responses
{
"id":"c0",
"result":{„hsoue":[„hous
e„, „has“]}
}
"spellcheck":{ "suggestions":[
„hsoue",{"numFound":5,
"startOffset":0, "endOffset":4,
"origFreq":0,
"suggestion":[{"word":„house","freq":
53}, {"word":"has","freq":271},
…
]}, "correctlySpelled",false,
"collation","hsue„
]},
![Page 23: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/23.jpg)
● A new component had to be realized in
OpenCms that basically
● Accepts spellcheck requests from tinyMCE
● Handles tinyMCE and Solr communication and
message conversion
● Checks and (re-)builds spellcheck indices
● The appropriate code is found in
org.opencms.search.solr.spellcheck
Glueing the Pieces together
![Page 24: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/24.jpg)
● Dictionaries can be edited easily in OpenCms
● Those indices are automatically filled by flat text
files, one word per line
● Support for multiple languages
● To access the dicts, have a look at the directory
org.opencms.workplace.spellcheck/resources/
Spellchecker in OpenCms
![Page 25: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/25.jpg)
● Adding a new language
1. Create new Solr field in schema.xml
2. Create new dictionary file inside VFS
3. Restart OpenCms
● Adding words to the custom dict
Extending the Spellchecker
![Page 26: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/26.jpg)
● Any Questions?
26
Any Questions?
Fragen? Questions ?
Questiones?
¿Preguntas? 質問
![Page 27: OpenCms Days 2014 - Using the SOLR collector](https://reader033.vdocuments.mx/reader033/viewer/2022042716/55a2b5441a28ab040d8b45e4/html5/thumbnails/27.jpg)
Sören Schneider
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
Thank you very much for your
attention! 27