intro to apache lucene and solr
DESCRIPTION
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.TRANSCRIPT
![Page 1: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/1.jpg)
Introduction to Open Source Search with Apache Lucene and SolrGrant Ingersoll
![Page 2: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/2.jpg)
Lucid Imagination, Inc.
The How Many Game
•How many of you:o Have taken a class in Information Retrieval (IR)?
o Are doing work/research in IR?
o Have heard of or are using Lucene?
o Have heard of or are using Solr?
o Are doing work on core IR algorithms such as compression techniques or scoring?
o Are doing UI/Application work/research as they relate to search?
![Page 3: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/3.jpg)
Lucid Imagination, Inc.
Topics
•Brief Bio
•Search 101 (skip?)
•What is:o Apache Lucene
o Apache Solr
•What can they do?o Features and functionality
o Intangibles
•What’s new in Lucene and Solr?o How can they help my research/work/____?
![Page 4: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/4.jpg)
Lucid Imagination, Inc.
Brief Bio
•Apache Lucene/Solr Committer
•Apache Mahout co-foundero Scalable Machine Learning
•Co-founder of Lucid Imaginationo http://www.lucidimagination.com
•Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy
•Co-Author of upcoming “Taming Text” (Manning Publications)o http://www.manning.com/ingersoll
![Page 5: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/5.jpg)
Lucid Imagination, Inc.
Search 101
•Search tools are designed for dealing with fuzzy data/questionso Works well with structured and unstructured data
o Performs well when dealing with large volumes of data
o Many apps don’t need the limits that databases place on contento Search fits well alongside a DB too
• Given a user’s information need, (query) find and, optionally, score content relevant to that needo Many different ways to solve
this problem, each with tradeoffs
•What’s “relevant” mean?
![Page 6: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/6.jpg)
Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSM
Search 101
Relevance IndexingFinds and maps terms and documents
Conceptually similar to a book index
At the heart of fast search/retrieve
![Page 7: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/7.jpg)
Lucid Imagination, Inc.
Apache Lucene in a Nutshell
•http://lucene.apache.org/java
•Java based Application Programming Interface (API) for adding search and indexing functionality to applications
•Fast and efficient scoring and indexing algorithms
•Lots of contributions to make common tasks easier:o Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
•Most widely deployed search library on the planet
![Page 8: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/8.jpg)
Lucid Imagination, Inc.
Lucene Basics
•Content is modeled via Documents and Fieldso Content can be text, integers, floats, dates, custom
o Analysis can be employed to alter content before indexing
•Searches are supported through a wide range of Query optionso Keyword
o Terms
o Phrases
o Wildcards
o Many, many more
![Page 9: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/9.jpg)
Lucid Imagination, Inc.
Apache Solr in a Nutshell
•http://lucene.apache.org/solr
•Lucene-based Search Server + other features and functionality
•Access Lucene over HTTP:o Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
•Most programming tasks in Lucene are configuration tasks in Solr
•Faceting (guided navigation, filters, etc.)
•Replication and distributed search support
•Lucene Best Practices
![Page 10: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/10.jpg)
A small sampling of Lucene/Solr-Powered Sites
10
Buy.com
![Page 11: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/11.jpg)
Lucid Imagination, Inc.
Features and Functionality
![Page 12: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/12.jpg)
Lucid Imagination, Inc.
Quick Solr/Lucene Demo•Pre-reqs:
o Apache Ant 1.7.x, Subversion (SVN)
•Command Line 1:o svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk
o cd solr-trunk/solr/
o ant example
o cd example
o java –Dsolr.clustering.enabled=true –jar start.jar
•Command Line 2o cd exampledocs; java –jar post.jar *.xml
•http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
![Page 13: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/13.jpg)
Lucid Imagination, Inc.
Other Features
•Data Import Handlero Database, Mail, RSS, etc.
•Rich document support via Apache Tikao PDF, MS Office, Images, etc.
•Replication for high query volume
•Distributed search for large indexeso Production systems with 1B+ documents
•Configurable Analysis chain and other extension pointso Total control over tokenization, stemming, etc.
![Page 14: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/14.jpg)
Lucid Imagination, Inc.
Intangibles
•Open Source
•Flexible, non-restrictive licenseo Apache License v2 – non-viral
o “Do what you want with the software, just don’t claim you wrote it”
•Large community willing to helpo Great place to learn about real world IR systems
•Many books and other documentationo Lucene in Action by Hatcher, McCandless and Gospodnetic
![Page 15: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/15.jpg)
Lucid Imagination, Inc.
What’s New?
•https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt
•https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt
•Codecso Pluggable Index Formats
o Provide Different index compression techniques
•Stats to enable alternate scoring approaches BM25, Lang. Modeling, etc. -- More work to be done here
•Fastero Java Strings are slow; convert to use byte arrays
![Page 16: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/16.jpg)
Lucid Imagination, Inc.
Other New Items
•Many new Analyzers (tokenizers, etc.)o Richer Language support (Hindi, Indonesian, Arabic, …)
•Richer Geospatial (Local) Search capabilitieso Score, filter, sort by distance
o http://wiki.apache.org/solr/SpatialSearch
•Results Groupingo Group Related Results
o http://wiki.apache.org/solr/FieldCollapsing
•More Faceting Capabilitieso Pivot
o New underlying algorithms
![Page 17: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/17.jpg)
Lucid Imagination, Inc.
How can Lucene/Solr help me?
Everyone• Fast indexing/search times means less time
waiting for jobs to complete• Completely Open (source, community)• Free to use, modify, etc.• Large community ready and willing to help
User Experience Researchers• Rapid UI prototyping• Total Control of results and facets• Easy to setup and use with little to no
programming required
IR Researchers• Flexible Indexing models (trunk)• Flexible Relevance models via functions
and other mechanisms• Extendable
Job Seekers• Google Summer of Code• Other Internships (see me)• Real programming skills that are highly
valued in industry• Publicly visible, demonstrable skills
Lucene/Solr
![Page 18: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/18.jpg)
Lucid Imagination, Inc.
Job Trends
http://www.indeed.com
![Page 19: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/19.jpg)
Lucid Imagination, Inc.
Other Things that Can Help
•Nutcho Crawling
o http://nutch.apache.org
•Mahouto Machine learning (clustering, classification, others)
o http://mahout.apache.org
•OpenNLPo Part of Speech, Parsers, Named Entity Recognition
o http://incubator.apache.org/opennlp
•Open Relevance Projecto Relevance Judgments
o http://lucene.apache.org/openrelevance
![Page 20: Intro to Apache Lucene and Solr](https://reader036.vdocuments.mx/reader036/viewer/2022062303/554a5771b4c905522f8b4cf3/html5/thumbnails/20.jpg)
Lucid Imagination, Inc.
Resources
•http://lucene.apache.org
•http://www.lucidimagination.com
•{java-user|solr-user}@lucene.apache.org
•@gsingers
•http://www.slideshare.net/gsingers