why search? (starring elasticsearch)

23
Why Search? (starring Elasticsearch) Doug Turnbull OpenSource Connections OpenSource Connections

Upload: doug-turnbull

Post on 31-Oct-2014

570 views

Category:

Technology


1 download

DESCRIPTION

Why do we need a dedicated search engine to search our unstructured text data? Why can't we just rely on the features built in most databases?

TRANSCRIPT

Page 1: Why Search? (starring Elasticsearch)

Why Search?(starring

Elasticsearch)Doug Turnbull

OpenSource Connections

OpenSource Connections

Page 2: Why Search? (starring Elasticsearch)

Hello

• Me@[email protected]

• Ushttp://o19s.comWorld class search consultants Right here in C’ville!

Hiring passionate interns!

OpenSource Connections

Page 3: Why Search? (starring Elasticsearch)

Why Search?

• What does a dedicated search engine do?o that a database doesn’t?

• Why not [MySQL|mongoDB|Cassandra | etc]?

• Why a dedicated search engine?

OpenSource Connections

Page 4: Why Search? (starring Elasticsearch)

Why not MySQL?

• We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts:

OpenSource Connections

PostID

UserId CreationDate

ViewCount

Body

0 1 2011-01-11T20:52:46.753

124 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>

1 2 2013-02-01T12:44:46.525

525 <p>Been meaning to read the Foundation Series, what should I read first?</p>

Page 5: Why Search? (starring Elasticsearch)

Why not MySQL?

• Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts!

OpenSource Connections

P U C V Body

0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>

1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p>

Found!

Missing!

Page 6: Why Search? (starring Elasticsearch)

Why not MySQL – SQL Like?

• SQL “LIKE” operator – scan all rows for a specific wildcard match

SELECT * FROM posts WHERE body LIKE "%darth vader%"

OpenSource Connections

Match?

Match?

Match?

Match?

Performs Table Scan

Approx 300ms to search a measly 20K docs!(what if we had 20 Million?)

Page 7: Why Search? (starring Elasticsearch)

SQL Like – other problems

• Can’t search for words out –of-order:

SELECT * FROM posts WHERE body LIKE "%vader, darth%"0 results

• Can’t search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie pictures%“

SELECT * FROM posts WHERE body LIKE "%kitteh pictures%"

OpenSource Connections

Page 8: Why Search? (starring Elasticsearch)

SQL Like – other problems

• No Ranking of Results – given these two docs:

OpenSource Connections

One might ask how none of the Jedi at Qui-Gon's funeral

noticed that there was a Dark Lord of the Sith standing right

behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there…

I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense,

- Directly about Darth Vader

- Darth Vader is a side topic here

Which should come first?

Page 9: Why Search? (starring Elasticsearch)

SQL Like| CTRL+F |grep is

1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no fuzziness!

3. Unranked -- Simply says y/n whether there is a match

OpenSource Connections

Page 10: Why Search? (starring Elasticsearch)

Search needs to be

1. FAST! A data structure that can efficiently take search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.

OpenSource Connections

Page 11: Why Search? (starring Elasticsearch)

Lets play with an

implementation

• Lucene -> Elasticsearch

OpenSource Connections

Lucene

Solr

Elasticsearch

• Lucene, 1999 by Doug Cutting• Java library for search

• Solr, 2006, Yonik Seely• First to put Lucene behind

an http interface• Still going strong

• Elasticsearch, 2010, Shay Banon• Alternative implementation• Extremely REST-Y

• Your database’s full text search featureso MySQL, for example has a FULLTEXT indexo Works for trivial cases, not the path of wisdom

Page 12: Why Search? (starring Elasticsearch)

Elasticsearch

• Create an index

curl –XPUT http://localhost:9200/stackexchange

• Index some docs!

curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{

“Body”: “<p>Darth Vader dined with Luke</p>”,“Title”: “...”}’

OpenSource Connections

Page 13: Why Search? (starring Elasticsearch)

What is being built?

The answer can be found in your textbook…

OpenSource Connections

Book Index:• Topics -> page no• Very efficient tool –

compare to scanning the whole book!

Lucene uses an index:• Tokens => document ids:

laser => [2, 4]light => [2, 5]lightsaber => [0, 1, 5,

7]

Page 14: Why Search? (starring Elasticsearch)

Computers == Dumb

• Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump

to page 9

• Computers are dumb, o “CAT” != “cat” – no match returnedo “cat” != “cats” – no match returned

• Hence, when indexing, normalize text to more searchable form:

cats -> catfitted -> fitalumnus -> alumnu

OpenSource Connections

Page 15: Why Search? (starring Elasticsearch)

Normalization aka Text Analysis

• Raw input Filtered (char filter)• <p>Darth Vader dined with Luke</p>• Darth Vader dined with Luke

• Tokenized, o Darth Vader dined with Lukeo [Darth] [Vader] [dined] [with] [Luke]

• Token filters (Lowercased, synonyms applied, remove pointless words)o [darth] [vader] [dine] [luke]

• Most importantly: this is highly configurable

OpenSource Connections

Page 16: Why Search? (starring Elasticsearch)

Normalization aka Text Analysis

OpenSource Connections

curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘

{ "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ]}

Page 17: Why Search? (starring Elasticsearch)

What is being built?

OpenSource Connections

field Body term darth doc 1

<metadata> doc 2

<metadata> term vader doc 1 <metadata> term dine doc 1

<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{

“Body”: “<p>Darth Vader dined with Luke</p>”,

“Title”: “...”}’

curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{

“Body”: “<p>We love Darth</p>”,“Title”: “...”}’

Page 18: Why Search? (starring Elasticsearch)

Ranking

OpenSource Connections

field Body term darth doc 1

<metadata> doc 2

<metadata> term vader doc 1 <metadata> term dine doc 1

<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{

“Body”: “<p>Darth Vader dined with Luke</p>”,

“Title”: “...”}’

curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{

“Body”: “<p>We love Darth</p>”,“Title”: “...”}’Can we store anything here

to help decide how relevant this term is for this doc?

Yes!- Term Frequency

- How much “darth” is in this doc?

- Position within document- Helps when we search

for the phrase “darth vader”

Page 19: Why Search? (starring Elasticsearch)

Query Documents

• When did Darth Vader and Luke have dinner?

OpenSource Connections

curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d '{ "query": { "match": { "Body": "luke darth dinner" } }}

User Query

Page 20: Why Search? (starring Elasticsearch)

What happens when we query?

OpenSource Connections

luke darth dinner

field Body term darth doc 1

<metadata> doc 2

<metadata> term vader doc 1 <metadata> term dine doc 1

<metadata>

How to consult index for matches?

Analysis

[luke][darth][dine]

[darth]

[dine]

...

Score for [darth] docs (1 and 2)

Score for [dine] docs (1)

Return sorted docs client

Page 21: Why Search? (starring Elasticsearch)

So Elasticsearch!

OpenSource Connections

• FAST!o Inverted index data structure is blazing fasto Lucene is probably the most tuned implementation

• FUZZY!o We use analysis to normalize text to canonical formso We can use positional information when querying (not

shown here)

• FRUITFUL!o Relevant documents are scored based on relative term

frequency

Page 22: Why Search? (starring Elasticsearch)

BUT WAIT THERE’S MORE

• Many non-traditional applications of “search”o Rank file directory by proximity to current directoryo Geographic-aided search, rank based on distance and

search relevancyo Q & A systems – Watson has a ton of Luceneo Log aggregation, ie Kibana -- because in Lucene

everything is indexed!

• And many features!o Spellcheckingo Facetso More-like-this document

OpenSource Connections

Page 23: Why Search? (starring Elasticsearch)

QUESTIONS?

OpenSource Connections