architecture of a search engine

19
Architecture of a Search Engine Paris Tech Talks #7 - April 14 @sylvainutard - @algolia

Upload: sylvain-utard

Post on 19-Jun-2015

359 views

Category:

Technology


1 download

DESCRIPTION

My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.

TRANSCRIPT

Page 1: Architecture of a search engine

Architecture of a Search Engine

Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia

Page 2: Architecture of a search engine

• Today Search means Google

• Search is a daily activity

• Search is complex

• DB are (probably) not handling text queries

• Speed and relevance are keys

• Fuzzy matching: typos!

2

Search

Page 3: Architecture of a search engine

• Databases

• Optimized for INSERT/UPDATE/DELETE/SELECT (that's a lot)

• Strong query syntax (mostly SQL)

• Some operations scan all your documents (missing index?)

3

Why Search engines?

Page 4: Architecture of a search engine

• Search engines

• HIGHLY optimized for “SELECT” (only)

• Full-text queries: understand what is a word

• Query execution time driven by the number of matching documents

• And obviously, “LIKE '%foo bar%’" is not full-text search

4

Why Search engines?

Page 5: Architecture of a search engine

5

Why Search engines?

Search

Push data periodically or

in realtime

Full-text search

Primary storage(DB, files, ...)

Search engine

Application

Page 6: Architecture of a search engine

• Input = documents

• Composed by multiple attributes (textual, numerical, geo)

• Output = documents

• Full-text query and/or numerical filters

• Understandable results: match score (ranking) + highlighting

6

How it works

Page 7: Architecture of a search engine

• 2 distinct processes

• Indexing: storing documents in a highly optimized way to answer queries

• Query

• Matching documents

• Ranking matched documents

7

Implementation

Page 8: Architecture of a search engine

• Indexing means building an “index“ or “inverted lists“

• A dedicated data structure optimized for search

• Input = a set of documents containing words

• Output = a set of words associated to documents

8

Implementation: Indexing process

Page 9: Architecture of a search engine

9

Implementation: Indexing process

foo bar baz

Doc 1

bar foo

Doc 2

baz baz qux

Doc 3

foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3Indexing

Inverted lists

Documents Index

Page 10: Architecture of a search engine

• Queries

• Goal = Retrieve all documents matching a user query

• Order results from the highest ranked to the lowest

10

Implementation: Query process

Page 11: Architecture of a search engine

11

Implementation: Query process

foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3

Inverted lists

Index

User query "baz"

Sort matching documents

Pagination

• 1-word query = inverted lists intersection

Page 12: Architecture of a search engine

12

Implementation: Query process

• N-words query = inverted lists intersection

foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3

Inverted lists

Index

User query "baz qux"

Sort matching documents

Intersect inverted lists

Pagination

Page 13: Architecture of a search engine

• But how do you handle typing mistakes?

• Edit-distance algorithms (ex: Levenshtein) !

• levenshtein(bar, baz) = 1 (substitution)!• levenshtein(bar, br) = 1 (deletion)!• levenshtein(bar, foobar) = 3 (addition)!

• Comparing a word with all known words would be too costly

13

Implementation: Query process

Page 14: Architecture of a search engine

14

Implementation: Query process

• The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal)

Doc 1 (pos=1, 3)Doc 2 (pos=3)

Doc 1 (pos=2)Doc 3 (pos=1)

Index

Doc 1 (pos=4)Doc 3 (pos=2)

b c

a o

r z o

f

Page 15: Architecture of a search engine

15

Implementation: Query process

Example: faz

Doc 1 (pos=1, 3)Doc 2 (pos=3)

Doc 1 (pos=2)Doc 3 (pos=1)

Index

Doc 1 (pos=4)Doc 3 (pos=2)

b c

a o

r z o

ffaz (distance=1)

faz (distance=0)faz (distance=1)

faz (distance=1)

faz (distance=2) faz (distance=1)

faz (distance=2)

faz (distance=3)

Page 16: Architecture of a search engine

• How are the matching documents ranked?

• Number of match occurrences? TF-IDF ?

• Numerical value reflecting popularity?

• Number of typing mistakes?

• Proximity between matched words?

• …

16

Implementation: Query process

Page 17: Architecture of a search engine

17

Several implementations

Page 18: Architecture of a search engine

• What I didn’t speak about:

• Numerical/Geo queries (Including operators)

• Advanced query syntax (boolean operators, proximity operators)

• Faceting & Aggregations (Categorization)

• Sharding (Horizontal scalability)

• Incremental indexing (Generational data structures)

• … (see u next time)

18

Missing subjects