search me: using lucene.net
DESCRIPTION
May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.TRANSCRIPT
SEARCH ME
Using Lucene.Net In Your Apps
About Me
Zachary Johnson Gramana
Engineer at Potts Consulting Group
Proud new father of Rex
Search is...
A vague term that encompasses multiple
problems.
Better term is “information retrieval”, or IR
system.
Interdisciplinary, drawing from:
computer science (parsing, data structures)
psychology (query grammar, human/computer
interact.)
linguistics (textual analysis)
information science (scoring/relevancy)
maths (document retrieval strategy)
Problems Solved
Information Overload
Transparently handle all kinds of data:
structured (hierarchical)
semi-structured (markup)
un-structured data (plain text)
Problems Solved
Information Overload
Find the information that users want,
not just the information they asked for.
Transparently handle all kinds of data:
structured (hierarchical)
semi-structured (markup)
un-structured data (plain text)
Single portal to multiple data types and
sources.
Do it fast!
Basic IR System Capabilities
Collection (importing, crawling) Anonymous web page crawling (google)
User-uploaded photographs (flickr)
Publisher upload of .mp3 files (iTunes)
Indexing Analysis
Modify index data structure
Querying Input parsing
Query generation & execution
Collecting the results
Filtering the results (optional)
What is Lucene.Net?
Port of the Apache Foundation‟s Lucene
libraries from Java to C#
It‟s a search library.
Lucene created by Doug Cutting
Named after his wife.
First released in 2000 on SourceForge
Migrated to Apache Foundation in 9/2001.
Used By
StackOverflow
JIRA
IBM
Akamai
Apple
Autodesk
Orchard
RavenDB
CouchDB
What Isn‟t Lucene.NET
Not a complete information retrieval system Check out Google Search Appliance instead:
http://www.google.com/enterprise/search/
Not a web-crawler. Check out Arachnode instead
http://arachnode.net
Not a query service. Check out SOLR instead
http://lucene.apache.org/solr
Not hard Check out Windows Search SDK instead
http://bit.ly/ImRtMk
Concept and Overview
What‟s In an Index?
Stores a collection of Documents, each of
which represent a source record.
Document contain:
Metadata about the source record.
(optionally) actual data from the source record.
(optionally) derived analytical products.
Documents store a collection of
token/frequency pairs (optionally position),
plus a document identifier.
Lucene‟s Index Structure
Documents store a collection of fields.
Fields are collection of terms, plus and identifier, and optional term vectors.
Terms are string key-value-pairs of a field name, and a string value.
Lucene provides special classes to deal with tricky data, like the NumericField class.
Term vectors are terms, along with their frequency counts and positions.
Fields can be indexed, stored, or both. Storing allows a term value to be retrieved after indexing.
Indexing adds the term value to Lucene‟s inverted index.
The Inverted Index
(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Lucene‟s Index Structure
What an „inverted index‟?
verted index: document points to collection of
terms
inverted index: term points to a collection of
documents
One or more segments
Self-contained, independent partition of the
entire index.
Stores: field names, field values, term dictionary,
term frequencies, term proximities, normalization
factor, term vectors, and (optional) deleted record
lookup table.
Analysis
(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )
Tokenization
(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Tokenization
Normalization: “Gramåna” > “gramana”
Stemming: “preschooling” > “school”
Norms
(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )
Time to Look at Some Code
Getting a Query
Two options:
Parse a search string using a QueryParser class.
Programatically build a query.
QueryParser can build very complex queries
very quickly, but requires user to provide a
query string.
Programatic building of a query requires less
overhead for simple queries.
General Query Types
(taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)
Some Lucene Query Types
TermQuery (general purpose)
BooleanQuery
MultiPhraseQuery
SpanQuery
WildcardQuery
FilteredQuery
MoreLikeThisQuery
BoostingQuery
FuzzyQuery
ConstantScoreRangeQuery
Time to Look at More Code
Lucene.Net Contribs
Spatial (geo-spatial search)
Similarity
SimpleFactedSearch
Highlighter
SpellChecker
WordNET (synonyms)
Snowball (stemming library)
RegEx
Thanks for your time and attention.
twitter: @zgramana
blog: http://www.excitabyte.com/
Email: zgramanaATgee mail dot com
That‟s All!