epl660: information retrieval and search engines lab 3 · pdf file• το hadoop...
Post on 05-Feb-2018
223 Views
Preview:
TRANSCRIPT
University of Cyprus
Department of
Computer Science
EPL660: Information
Retrieval and Search
Engines – Lab 3
Παύλος Αντωνίου
Γραφείο: B109, ΘΕΕ01
Apache Solr
• Popular, fast, open-source search platform built
on Apache Lucene from the Apache Lucene
project
• Written in Java and runs as a standalone full-text
search server with standalone or distributed
(SolrCloud) operation
• Solr uses the Lucene Java search library at its core
for full-text indexing and search
Apache Solr Features
• XML/HTTP and JSON APIs
• Hit highlighting
• Faceted Search and Filtering
• Near real-time indexing
• Database integration
• Rich document (e.g., Word, PDF) handling
• Geospatial Search
• Fast Incremental Updates and Index Replication
• Caching
• Replication
• Web administration interface etc
Apache Solr vs Apache Lucene
• Relationship between Solr and Lucene is that of a
car and its engine.
• You can't drive an engine, but you can drive a car.
• Lucene is a library which you can't use as-is,
whereas Solr is a complete application which
you can use out-of-box.
• Unlike Lucene, Solr is a web application (WAR)
which can be deployed in any servlet container,
e.g. Jetty, Tomcat, Resin, etc.
– single JAR file needed to deploy application on server
• Solr can be installed and used easily by non-
programmers. Lucene needs programming skills.
When to use Lucene?
• Need for embedded search functionality into a
desktop application for example
• Need very customized requirements requiring
low-level access to the Lucene API classes
– Solr may be more a hindrance than a help, since it is an
extra layer of indirection.
SolrCloud
• Apache Solr includes the ability to set up a cluster
of Solr servers that combines fault tolerance and
high availability: SolrCloud
• SolrCloud allows for distributed search and
indexing
• SolrCloud features:
– Central configuration for the entire cluster
– Automatic load balancing and fail-over for queries
– ZooKeeper integration for cluster coordination and
configuration
SolrCloud Concepts
• A Cluster is made up of one or more Solr Nodes,
which are running instances of the Solr server
process
SolrCloud Concepts
• A Cluster can host multiple Collections of Solr
Documents
• A collection can be partitioned into multiple
Shards (pieces), which contain a subset of the
Documents in the Collection
• Each Shard can be replicated (Leader & Replicas)
SolrCloud Concepts
• The number of Shards that a Collection has
determines:
– The theoretical limit to the number of Documents that
Collection can reasonably contain.
– The amount of parallelization that is possible for an
individual search request.
• The number of Replicas that each Shard has
determines:
– The level of redundancy built into the Collection and
how fault tolerant the Cluster can be in the event that
some Nodes become unavailable.
– The theoretical limit in the number concurrent search
requests that can be processed under heavy load.
Getting Started
• Download Apache Solr from
http://www.eu.apache.org/dist/lucene/solr/7.2.0/solr-
7.2.0.tgz (or zip for windows)
• Extract zip and go to solr directory
• Open a terminal and type:
bin/solr start -e cloud -noprompt
• This will start up a SolrCloud cluster with
embedded ZooKeeper (cloud management
service) on local workstation with 2 nodes
– First node listens on port 8983 & second on port 7574
• You can see that the Solr is running by loading
http://localhost:8983/solr/ in your web browser.
Solr web interface
SolrCloud
• Preview collections on tab
– One collection created automatically gettingstarted
– Collection is partioned into 2 shards
• First node stores 2 leader shards / Second stores 2 replicas
• Solr server is up and running, with one collection
but no data indexed
• Important files configuration files: solrconfig.xml,
managed-schema– solr-dir/server/solr/configsets/_default/conf/solrconfig.xml
– solr-dir/server/solr/configsets/_default/conf/managed-schema
How Solr Sees the World
• Document: basic unit of information
– set of data that describes something
• E.g. document about a person, for example, might contain the
person’s name, biography, favorite color, and shoe size
– documents are expected to be composed of fields,
which are more specific pieces of information
• E.g. "first_name":"Pavlos", "shoe_size":42
– fields can contain different types of data
• first_name text, shoe_size number
• User defines type of each field
• Field type tells Solr how to interpret the field and how it can be
queried
– When document added into a collection, Solr takes
values from document fields and add them to index
– Queries consult index, return matching docs
Field Analysis Process
• How does Solr process document fields when
building an index?
– Example: biography field in a person document"biography": "He received his Ph.D. from Department of
Computer Science of the University of Cyprus, in 2012"
– Index every word of biography in order to find quickly
people whose lives have had anything to do with university, or computer. Any issues?
• What if biography contains a lot of common words you don’t
really care about like "he", "the", "a", "to", "for", "is" (stop
words)?
• What if biography contains the word "University" and a user
makes a query for "university"?
• Solution: field analysis
Field Analysis Process
• For each field, you can tell Solr:
– how to break apart the text into words (tokenization)
• E.g. split at whitespaces, commas, etc.
– to remove stop words (filtering)
– to make lower case normalization
– to remove accents marks
Read more here: Understanding
Analyzers, Tokenizers, and Filters
Schema files and manipulation
• Solr stores details about the field types and fields
it is expected to understand in a schema file:
– managed-schema is the name for the schema file Solr
uses by default to support making Schema changes at
runtime via the Schema API (via HTTP), or Schemaless
Mode / avoid hand editing of the managed schema file
– schema.xml is the traditional name for a schema file
which can be edited manually by users who use the
ClassicIndexSchemaFactory
– If you are using SolrCloud you may not be able to find
any file by these names on the local filesystem. You will
only be able to see the schema through the Schema
API (if enabled) or through the Solr Admin UI’s Cloud
Screens
Field Analysis
• Schema defines
– The kind of fields available for indexing
– The type of analysis to be applied when indexing or
querying each field
– Available field types such as float, long, double, date,
text
• Explore the schema using Schema tab (see next
slide)
– Example: choose “*_txt” field to see how solr behaves
to field names ending by _txt
Field Analysis
Schema tab
indexed fields are fields
which pass through
analysis phase, and are
added to the index so as
to be searchable/sortable
by queries
stored fields are fields whose the original
text is stored in the index somewhere so
as to be retrievable by queries
Field Analysis
• Go to the Analysis Tab (see next slide) to see
how a text value is broken down into words by
Index and Query time analysis
– Field Value (Index): He received his Ph.D. from
Department of Computer Science of the
University of Cyprus, in 2012
– Analyse Fieldname / FieldType: text_en
Field Analysis
Insert text to Analyze
Analysis tab
Field Analysis
The word of has been “stopped”
Indexing XML Data
• Solr includes a simple command line tool for
POSTing various types of content to a Solr server
– /bin/post in UNIX, different usage in Windows
• Let's first index two XML files
– UNIX: remain into solr directory
• bin/post –c gettingstarted
example/exampledocs/solr.xml
example/exampledocs/monitor.xml
– Windows: go to examples/exampledocs dir
• java -Dc=gettingstarted -jar post.jar solr.xml
monitor.xml
• You have now indexed two documents in Solr
• Browse the documents indexed at– http://localhost:8983/solr/gettingstarted/browse
Collection browsing
Collection querying
Querying Data via Solr Admin UI
• Solr can be queried via REST clients, curl, wget,
Chrome POSTMAN, etc., as well as via native
clients available for many programming languages.
• Solr Admin UI includes a query builder interface
– In Admin interface choose gettingstarted collection
– In "Query" tab click button to display results
RequestHandlers are specified in solrconfig.xml
<requestHandler name="/select“ class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
</lst>
</requestHandler>
<initParams
path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">_text_</str>
</lst>
</initParams>
Search for anything
Default search field: text
Querying Data via Solr Admin UI
– Enter "solr" in the "q" text box, to search for "solr" in
the index
• Why no results returned?
– Default field for searching the word solr is text. No text field
includes solr
– Change df to name and press button again
– Results can be also previewed in browser:
http://localhost:8983/solr/gettingstarted/select?q=solr&df=
name (response in JSON format)
http://localhost:8983/solr/gettingstarted/select?q=solr&df=
name&wt=xml (response in XML format)
Querying Data via Solr Admin UI
RESTful url to query Solr.
Can be used when querying
Solr from custom apps.
Querying Data
• Index all .xml documents in ‘example/exampledocs’UNIX: /bin/post -c gettingstarted
example/exampledocs/*.xml
Windows: java -Dc=gettingstarted -jar post.jar *.xml
• ...and now you can search for all sorts of things
using the default Solr Query Syntax (a superset of
the Lucene query syntax)...
– video
– name:*Video*
– address_s:*ist*
– +video +price:[* TO 400]
• docs having video in searchable fields and price up to 400
– -address_s:*
• docs that do not have address_s field
Updating Data
• Although solr.xml has been POSTed to the
server twice
– “q”: solr
" { "numFound": 1, "start": 0, …
– Why?
"docs": [ { "id": "SOLR1000",
• This is because the example schema.xml
specifies a "uniqueKey" field called "id".
• Whenever you POST commands to Solr to add a
document with the same value for the uniqueKey
as an existing document, it automatically replaces
it for you.
Updating Data
• You can see that that has happened by looking at
the values for numDocs and maxDoc in the
"CORE"/searcher section of the statistics page...
• http://localhost:8983/solr/index.html#/gettingstarte
d/plugins?entry=searcher&type=core
Deleting Data
• You can delete data by POSTing a delete
command to the update URL and specifying the
value of the document's unique key field, or a query
that matches multiple documents
java -Dc=gettingstarted -Ddata=args -
Dcommit=false -jar post.jar
"<delete><id>SP2514N</id></delete>"
• Delete documents that match a specific query
java -Dc=gettingstarted -Dcommit=false -
Ddata=args -jar post.jar
"<delete><query>name:*DDR*</query></delete>"
Querying Data via REST API
• Searches are done via HTTP GET on the select
URL with the query string in the q parameter.
• You can pass a number of optional request
parameters to the request handler to control what
information is returned.
– use the "fl" parameter to control what stored fields are
returned, and if the relevancy score is returned:
• q=video&fl=name,id (return only name and id fields)
• q=video&fl=name,id,score (return relevancy score as well)
• q=video&fl=*,score (return all stored fields, as well as relevancy
score)
• q=video&sort=address_s desc&fl=name,id,price (add sort
specification: sort by address_s descending)
• q=video&wt=json (return response in JSON format)
Sorting
• Solr provides a simple method to sort on one or more
indexed fields. Use the "sort' parameter to specify "field
direction" pairs, separated by commas if there's more than
one sort field:
– q=video&sort=price desc
– q=video&sort=price asc
– q=video&sort=inStock asc, price desc
• "score" can also be used as a field name when specifying
a sort:
– q=video&sort=score desc
– q=video&sort=inStock asc, score desc
• Complex functions may also be used to sort results:
– q=video&sort=div(popularity,add(price,1)) desc
• If no sort is specified, the default is score desc to return
the matches having the highest relevancy
Indexing “Rich” Data
• Index local "rich" files including HTML, PDF,
Microsoft Office formats (such as MS Word), plain
text and many other formats found in /docs
– UNIX: bin/post -c gettingstarted docs/
Index Data
• There are many other different ways to import
your data into Solr... one can:
– Import records from a database using the Data Import
Handler (DIH)
• see tutorial here for MySQL or SQL Server database import
– Load a CSV file (comma separated values), including
those exported by Excel or MySQL.
– POST JSON documents
– Index binary documents such as Word and PDF
with Solr Cell (ExtractingRequestHandler).
– Use SolrJ for Java or other Solr clients to
programatically create documents to send to Solr.
Stopping SolrCloud
• Stop SolrCloud nodes
– bin/solr stop -all
• Delete Solr home for nodes (if needed):
– rm -rf example/cloud/node1
– rm -rf example/cloud/node2
Useful Links
• http://lucene.apache.org/solr/index.html
• http://lucene.apache.org/solr/quickstart.html
• http://wiki.apache.org/solr/SolrResources
– Next Week: ElasticSearch
top related