why search? (starring elasticsearch)
DESCRIPTION
Why do we need a dedicated search engine to search our unstructured text data? Why can't we just rely on the features built in most databases?TRANSCRIPT
Why Search?(starring
Elasticsearch)Doug Turnbull
OpenSource Connections
OpenSource Connections
Hello
• Me@[email protected]
• Ushttp://o19s.comWorld class search consultants Right here in C’ville!
Hiring passionate interns!
OpenSource Connections
Why Search?
• What does a dedicated search engine do?o that a database doesn’t?
• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?
OpenSource Connections
Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts:
OpenSource Connections
PostID
UserId CreationDate
ViewCount
Body
0 1 2011-01-11T20:52:46.753
124 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>
1 2 2013-02-01T12:44:46.525
525 <p>Been meaning to read the Foundation Series, what should I read first?</p>
Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts!
OpenSource Connections
P U C V Body
0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>
1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p>
Found!
Missing!
Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
OpenSource Connections
Match?
Match?
Match?
Match?
Performs Table Scan
Approx 300ms to search a measly 20K docs!(what if we had 20 Million?)
SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader, darth%"0 results
• Can’t search for alternate forms of a word:
SELECT * FROM posts WHERE body LIKE "%kittie pictures%“
SELECT * FROM posts WHERE body LIKE "%kitteh pictures%"
OpenSource Connections
SQL Like – other problems
• No Ranking of Results – given these two docs:
OpenSource Connections
One might ask how none of the Jedi at Qui-Gon's funeral
noticed that there was a Dark Lord of the Sith standing right
behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there…
I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense,
- Directly about Darth Vader
- Darth Vader is a side topic here
Which should come first?
SQL Like| CTRL+F |grep is
1. Extremely Slow
2. Not fuzzy -- Needs exact literal matches, no fuzziness!
3. Unranked -- Simply says y/n whether there is a match
OpenSource Connections
Search needs to be
1. FAST! A data structure that can efficiently take search terms and return a set of documents
2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching
3. FRUITFUL! Relevant documents bubble to the top.
OpenSource Connections
Lets play with an
implementation
• Lucene -> Elasticsearch
OpenSource Connections
Lucene
Solr
Elasticsearch
• Lucene, 1999 by Doug Cutting• Java library for search
• Solr, 2006, Yonik Seely• First to put Lucene behind
an http interface• Still going strong
• Elasticsearch, 2010, Shay Banon• Alternative implementation• Extremely REST-Y
• Your database’s full text search featureso MySQL, for example has a FULLTEXT indexo Works for trivial cases, not the path of wisdom
Elasticsearch
• Create an index
curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,“Title”: “...”}’
OpenSource Connections
What is being built?
The answer can be found in your textbook…
OpenSource Connections
Book Index:• Topics -> page no• Very efficient tool –
compare to scanning the whole book!
Lucene uses an index:• Tokens => document ids:
laser => [2, 4]light => [2, 5]lightsaber => [0, 1, 5,
7]
Computers == Dumb
• Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9
• Computers are dumb, o “CAT” != “cat” – no match returnedo “cat” != “cats” – no match returned
• Hence, when indexing, normalize text to more searchable form:
cats -> catfitted -> fitalumnus -> alumnu
OpenSource Connections
Normalization aka Text Analysis
• Raw input Filtered (char filter)• <p>Darth Vader dined with Luke</p>• Darth Vader dined with Luke
• Tokenized, o Darth Vader dined with Lukeo [Darth] [Vader] [dined] [with] [Luke]
• Token filters (Lowercased, synonyms applied, remove pointless words)o [darth] [vader] [dine] [luke]
• Most importantly: this is highly configurable
OpenSource Connections
Normalization aka Text Analysis
OpenSource Connections
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘
{ "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ]}
What is being built?
OpenSource Connections
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,
“Title”: “...”}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
“Body”: “<p>We love Darth</p>”,“Title”: “...”}’
Ranking
OpenSource Connections
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,
“Title”: “...”}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
“Body”: “<p>We love Darth</p>”,“Title”: “...”}’Can we store anything here
to help decide how relevant this term is for this doc?
Yes!- Term Frequency
- How much “darth” is in this doc?
- Position within document- Helps when we search
for the phrase “darth vader”
Query Documents
• When did Darth Vader and Luke have dinner?
OpenSource Connections
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d '{ "query": { "match": { "Body": "luke darth dinner" } }}
User Query
What happens when we query?
OpenSource Connections
luke darth dinner
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
How to consult index for matches?
Analysis
[luke][darth][dine]
[darth]
[dine]
...
Score for [darth] docs (1 and 2)
Score for [dine] docs (1)
Return sorted docs client
So Elasticsearch!
OpenSource Connections
• FAST!o Inverted index data structure is blazing fasto Lucene is probably the most tuned implementation
• FUZZY!o We use analysis to normalize text to canonical formso We can use positional information when querying (not
shown here)
• FRUITFUL!o Relevant documents are scored based on relative term
frequency
BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”o Rank file directory by proximity to current directoryo Geographic-aided search, rank based on distance and
search relevancyo Q & A systems – Watson has a ton of Luceneo Log aggregation, ie Kibana -- because in Lucene
everything is indexed!
• And many features!o Spellcheckingo Facetso More-like-this document
OpenSource Connections
QUESTIONS?
OpenSource Connections