apache solr tech doc
Post on 18-Jul-2015
116 Views
Preview:
TRANSCRIPT
Apache Solr Technical Document
Contents Requirements ................................................................................................................................................ 3
Solution - Solr ................................................................................................................................................ 3
Features .................................................................................................................................................... 3
Typical Solr Setup Diagram ....................................................................................................................... 4
Basic Solr Concepts ................................................................................................................................... 4
1. Indexing ............................................................................................................................................. 4
2. How Solr represents data .................................................................................................................. 5
Installing Solr ............................................................................................................................................. 7
Starting Solr ............................................................................................................................................... 7
Indexing Data ............................................................................................................................................ 7
Searching ................................................................................................................................................... 8
Faceting ................................................................................................................................................. 9
Highlighting ......................................................................................................................................... 10
Spell Checking ..................................................................................................................................... 10
Relevance ............................................................................................................................................ 10
Shutdown ................................................................................................................................................ 10
Screen Shots ............................................................................................................................................ 11
Apache SolrCloud ........................................................................................................................................ 15
Features .................................................................................................................................................. 15
Simple two shard cluster......................................................................................................................... 15
Dealing with high volume of data ........................................................................................................... 18
Dealing with failure ................................................................................................................................. 19
Synchronization of data (added/updated in DB) with Solr ..................................................................... 20
Limitations .............................................................................................................................................. 20
Screen Shots ............................................................................................................................................ 21
Integration with .Net using SolrNet ........................................................................................................ 23
Requirements
a. Fast and full text search capabilities
b. Optimization of huge data on web traffic
c. Highly and linearly scalable on demand
d. Plug with any platform
e. Near real time search and indexing
f. Flexible and Adaptable with XML,JSON,CSV configuration
Solution - Solr Solr is a standalone enterprise search server with a REST-like API. You put documents in it
(called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and
receive XML, JSON, CSV or binary results.
Features
Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML, JSON and HTTP
Comprehensive HTML Administration Interfaces
Linearly scalable, auto index replication, auto failover and recovery
Near Real-time indexing
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture
Easily manage multilingual support
Typical Solr Setup Diagram
Figure 1 Typical Solr Setup Diagram
Basic Solr Concepts
In this document, we'll cover the basics of what you need to know about Solr in order to use it.
1. Indexing
Solr is able to achieve fast search responses because, instead of searching the text directly, it
searches an index instead.
This is like retrieving pages in a book related to a keyword by scanning the index at the back of
a book, as opposed to searching every word of every page of the book.
This type of index is called an inverted index, because it inverts a page-centric data structure
(page->words) to a keyword-centric data structure (word->pages).
Solr stores this index in a directory called index in the data directory.
2. How Solr represents data
In Solr, a Document is the unit of search and index.
An index consists of one or more Documents, and a Document consists of one or more Fields.
Schema
Before adding documents to Solr, you need to specify the schema, represented in a file
called schema.xml. It is not advisable to change the schema after documents have been added
to the index.
The schema declares:
o what kinds of fields there are
o which field should be used as the unique/primary key
o which fields are required
o how to index and search each field
Field Types
In Solr, every field has a type.
Examples of basic field types available in Solr include:
o float
o long
o double
o date
o text
Defining a field
Here's what a field declaration looks like:
<field name="id" type="text" indexed="true" stored="true"multiValued="true"/>
o name: Name of the field
o type: Field type
o indexed: this field be added to the inverted index
o stored: the original value of this field be stored
o multivalued: this field have multiple values
The indexed and stored attributes are important.
Analysis
When data is added to Solr, it goes through a series of transformations before being added to
the index. This is called the analysis phase. Examples of transformations include lower-casing,
removing word stems etc. The end result of the analysis is a series of tokens which are then
added to the index. Tokens, not the original text, are what are searched when you perform a
search query.
Indexed fields are fields which undergo an analysis phase, and are added to the index.
Term Storage
When we displaying search results to users, they generally expect to see the original document,
not the machine-processed token.
That's the purpose of the stored attribute to tell Solr to store the original text in the index
somewhere.
Sometimes, there are fields which aren't searched, but need to display in the search results.
You accomplish that by setting the field attributes to stored=true and indexed=false.
So, why wouldn't you store all the fields all the time?
Because storing fields increases the size of the index, and the larger the index, the slower the
search. In terms of physical computing, we'd say that a larger index requires more disk seeks to
get to the same amount of data.
Installing Solr
You should also have JDK 5 or above installed.
Begin by unziping the Solr release and changing your working directory to be the "example"
directory.
unzip –q apache-solr-4.1.0.zip
cd apache-solr-4.1.0/example/
Starting Solr
Solr comes with an example directory which contains some sample files we can use.
We start this example server with java -jar start.jar.
cd example
java -jar start.jar
You should see something like this in the terminal.
2011-10-02 05:20:27.120:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-10-02 05:20:27.212:INFO::jetty-6.1-SNAPSHOT
....
2011-10-02 05:18:27.645:INFO::Started SocketConnector@0.0.0.0:8983
Solr is now running! You can now access the Solr Admin webapp by loading
http://localhost:8983/solr/admin/ in your web browser.
Indexing Data
We're now going to add some sample data to our Solr instance.
The exampledocs folder contains some XML files we can posting them from the command line
cd exampledocs
java -jar post.jar solr.xml monitor.xml
That produces:
SimplePostTool: POSTing files to http://localhost:8983/solr/update.
SimplePostTool: POSTing file solr.xml
SimplePostTool: POSTing file monitor.xml
SimplePostTool: COMMITting Solr index changes.
This response tells us that the POST operation was successful.
You can also index all of the sample data, using the following command (assuming your
command line shell supports the *.xml notation):
cd exampledocs
java -jar post.jar *.xml
Searching
Let's see if we can retrieve the document we just added below URL on browser.
Since Solr accepts HTTP requests, you can use your web browser to communicate with
Solr: http://localhost:8983/solr/select?q=*:*&wt=json
This returns the following JSON result:
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"wt": "json",
"q": "*:*"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "3007WFP",
"name": "Dell Widescreen UltraSharp 3007WFP",
"manu": "Dell, Inc.",
"includes": "USB cable",
"weight": 401.6,
"price": 2199,
"popularity": 6,
"inStock": true,
"store": "43.17614,-90.57341",
"cat": [
"electronics",
"monitor"
],
"features": [
"30\" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast"
]
}
]
}
}
Faceting
Faceting is the arrangement of search results into categories based on indexed terms. Searchers
are presented with the indexed terms along with numerical counts of how many matching
documents were found were each term. Faceting makes it easy for users to explore search
results, narrowing in on exactly the results they are looking for.
Highlighting
Highlighting in Solr allows fragments of documents that match the user's query to be included
with the query response. The fragments are included in a special section of the response
(the highlighting section), and the client uses the formatting clues also included to determine
how to present the snippets to users.
Spell Checking
The Spellcheck component is designed to provide inline query suggestions based on other,
similar, terms.
Relevance
Relevance is the degree to which a query response satisfies a user who is searching for
information.
The relevance of a query response depends on the context in which the query was performed.
A single search application may be used in different contexts by users with different needs and
expectations. For example, a search engine of climate data might be used by a university
researcher studying long-term climate trends, a farmer interested in calculating the likely date
of the last frost of spring, a civil engineer interested in rainfall patterns and the frequency of
floods, and a college student planning a vacation to a region and wondering what to pack.
Because the motivations of these users vary, the relevance of any particular response to a
query will vary as well.
Shutdown
To shut down Solr, from the terminal where you launched Solr, hit Ctrl+C. This will shut down
Solr cleanly.
Link: http://lucene.apache.org/solr/3_6_2/doc-files/tutorial.html
http://www.solrtutorial.com/
https://cwiki.apache.org/confluence/display/solr/
Screen Shots
Figure 2 Solr Admin UI-Dashboard Screen
Figure 3 Solr Admin UI-Collection Detail Screen
Figure 4 Solr Admin UI-Query Result Screen
Figure 5 Solr Admin UI-Fetching Data from Database Using DataImportHandler
Figure 6 Solr Admin UI-Schema.xml Screen
Figure 7 Solr Admin UI-SolrConfig.xml Screen
Figure 8 Solr Admin UI-Core Admin Detail Screen
Figure 9 Solr Admin UI-Java Properties Screen
Apache SolrCloud
SolrCloud is the name of a set of new distributed capabilities in Solr. Passing parameters to
enable these capabilities will enable you to set up a highly available, fault tolerant cluster of
Solr servers. Use SolrCloud when you want high scale, fault tolerant, distributed indexing and
search capabilities.
Solr embeds and uses Zookeeper as a repository for cluster configuration and coordination -
think of it as a distributed filesystem that contains information about all of the Solr servers.
Note: reset all configurations and remove documents from the tutorial before going through
the cloud features.
Features
Centralized Apache ZooKeeper based configuration
Automated distributed indexing/sharding - send documents to any node and it will be
forwarded to correct shard
Near Real-Time indexing
Transaction log ensures no updates are lost even if the documents are not yet indexed to
disk
Automated query failover, index leader election and recovery in case of failure
No single point of failure
Simple two shard cluster
Figure 10 Simple Two Shard Cluster Image
This example simply creates a cluster consisting of two solr servers representing two different shards of a collection.
Since we'll need two solr servers for this example, simply make a copy of the example directory for the second server -- making sure you don't have any data already indexed.
rm -r example/solr/collection1/data/* cp -r example example2
This command starts up a Solr server and bootstraps a new solr cluster.
cd example java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
-DzkRun causes an embedded zookeeper server to be run as part of this Solr server.
-Dbootstrap_confdir=./solr/collection1/conf, this parameter causes the local configuration directory ./solr/conf to be uploaded as the "myconf" config. The name "myconf" is taken from the "collection.configName" param below.
-Dcollection.configName=myconf sets the config to use for the new collection.
-DnumShards=2 the number of logical partitions we plan on splitting the index into.
Browse to http://localhost:8983/solr/#/~cloud to see the state of the cluster (the zookeeper distributed filesystem).
You can see from the zookeeper browser that the Solr configuration files were uploaded under "myconf", and that a new document collection called "collection1" was created. Under collection1 is a list of shards, the pieces that make up the complete collection.
Now we want to start up our second server - it will automatically be assigned to shard2 because we don't explicitly set the shard id.
Then start the second server, pointing it at the cluster:
cd example2 java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
-Djetty.port=7574 is just one way to tell the Jetty servlet container to use a different port.
-DzkHost=localhost: 9983 points to the Zookeeper ensemble containing the cluster state. In this example we're running a single Zookeeper server embedded in the first Solr server. By default, an embedded Zookeeper server runs at the Solr port plus 1000, so 9983.
If you refresh the zookeeper browser, you should now see both shard1 and shard2 in collection1. View http://localhost:8983/solr/#/~cloud.
Next, index some documents.
cd exampledocs java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar ipod_video.xml java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar monitor.xml java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar mem.xml
And now, a request to either server results in a distributed search that covers the entire collection:
http://localhost:8983/solr/collection1/select?q=*:*
If at any point you wish to start over fresh or experiment with different configurations, you can delete all of the cloud state contained within zookeeper by simply deleting the solr/zoo_data directory after shutting down the servers.
Dealing with high volume of data
Solution: If the data volume goes high then creating more shards or splitting shard with
physical memory and storage in existing cluster cloud environment.
Figure 11 Creating Shard and Replica when volume goes high
Link: http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-
500000-volumes-5-million-volumes-and-beyond
Dealing with failure
Solution:
a. Failure of zookeeper: To avoid failure keeping zookeeper in two separate server so if one goes down then other can work because zookeeper has maintain all the cluster state and configuration information .
b. Failure of Solr shard: We can create the replica of each shard so if one shard goes down then replica can do our job.
Figure 12 Diagram which handling failure scenario
Link:
https://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_a
nd_zookeeper_ensemble
Synchronization of data (added/updated in DB) with Solr
Solution:
a. We can create the cron job which can fetch data from database and updating index in Solr.
b. Another option is that as and when data is added/update in frontend, after inserting/updating data in database from business layer, we can add piece of code which can add/update data using update Solr APIs (as we have integration with .net we can use SolrNet library which provides such addition/updation APIs).
Link: http://wiki.apache.org/solr/DataImportHandler#Scheduling
http://stackoverflow.com/questions/6463844/how-to-index-data-in-solr-from-database-
automatically
Limitations
1. No more than 50 to 100 million documents per node.
2. No more than 250 fields per document.
3. No more than 250K characters per document.
4. No more than 25 faceted fields.
5. No more than 32 nodes in your SolrCloud cluster.
6. Don't return more than 250 results on a query.
A major driving factor for Solr performance is RAM. Solr requires sufficient memory for two separate things: One is the Java heap, the other is "free" memory for the OS disk cache.
It is strongly recommended that Solr runs on a 64-bit Java. A 64-bit Java requires a 64-bit operating system, and a 64-bit operating system requires a 64-bit CPU. There's nothing wrong with 32-bit software or hardware, but a 32-bit Java is limited to a 2GB heap, which can result in artificial limitations that don't exist with a larger heap.
Link: http://lucene.472066.n3.nabble.com/Solr-limitations-td4076250.html
https://wiki.apache.org/solr/SolrPerformanceProblems
Screen Shots
Figure 13 Solr Admin UI-Cloud Screen
Figure 14 Solr Admin UI-Zookeeper maintains Cluster State Information that is shown in Tree Screen
Figure 15 Solr Admin UI-Cloud Graph Screen
Figure 16 Solr Admin UI-Cluster Information Screen
Integration with .Net using SolrNet
Solr exposes REST apis which can be used for interacting with Solr, however it needs serialization in
converting documents retuned as search result to fill in actual object container. Solrnet is .Net library for
interacting with Solr. It provides convenient and easy apis to search, add, update data in Solr. Further
information on SolrNet is available at https://github.com/mausch/SolrNet
Figure 17 Integration with .Net
top related