architecture of a search engine

Architecture of a Search Engine

Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia

• Today Search means Google

• Search is a daily activity

• Search is complex

• DB are (probably) not handling text queries

• Speed and relevance are keys

• Fuzzy matching: typos!

2

Search

• Databases

• Optimized for INSERT/UPDATE/DELETE/SELECT (that's a lot)

• Strong query syntax (mostly SQL)

• Some operations scan all your documents (missing index?)

3

Why Search engines?

• Search engines

• HIGHLY optimized for “SELECT” (only)

• Full-text queries: understand what is a word

• Query execution time driven by the number of matching documents

• And obviously, “LIKE '%foo bar%’" is not full-text search

4

Why Search engines?

5

Why Search engines?

Search

Push data periodically or

in realtime

Full-text search

Primary storage(DB, files, ...)

Search engine

Application

• Input = documents

• Composed by multiple attributes (textual, numerical, geo)

• Output = documents

• Full-text query and/or numerical filters

• Understandable results: match score (ranking) + highlighting

6

How it works

• 2 distinct processes

• Indexing: storing documents in a highly optimized way to answer queries

• Query

• Matching documents

• Ranking matched documents

7

Implementation

• Indexing means building an “index“ or “inverted lists“

• A dedicated data structure optimized for search

• Input = a set of documents containing words

• Output = a set of words associated to documents

8

Implementation: Indexing process

9

Implementation: Indexing process

foo bar baz

Doc 1

bar foo

Doc 2

baz baz qux

Doc 3

foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3Indexing

Inverted lists

Documents Index

• Queries

• Goal = Retrieve all documents matching a user query

• Order results from the highest ranked to the lowest

10

Implementation: Query process

11


foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3

Inverted lists

Index

User query "baz"

Sort matching documents

Pagination

• 1-word query = inverted lists intersection

12


• N-words query = inverted lists intersection

foo

bar

baz

qux

Doc 1, Doc 2

Doc 1, Doc 3

Doc 1, Doc 2

Doc 3

Inverted lists

Index

User query "baz qux"

Sort matching documents

Intersect inverted lists

Pagination

• But how do you handle typing mistakes?

• Edit-distance algorithms (ex: Levenshtein) !

• levenshtein(bar, baz) = 1 (substitution)!• levenshtein(bar, br) = 1 (deletion)!• levenshtein(bar, foobar) = 3 (addition)!

• Comparing a word with all known words would be too costly

13


14


• The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal)

Doc 1 (pos=1, 3)Doc 2 (pos=3)

Doc 1 (pos=2)Doc 3 (pos=1)

Index


b c

a o

r z o

f

15


Example: faz

Doc 1 (pos=1, 3)Doc 2 (pos=3)


Index


b c

a o

r z o

ffaz (distance=1)

faz (distance=0)faz (distance=1)

faz (distance=1)

faz (distance=2) faz (distance=1)

faz (distance=2)

faz (distance=3)

• How are the matching documents ranked?

• Number of match occurrences? TF-IDF ?

• Numerical value reflecting popularity?

• Number of typing mistakes?

• Proximity between matched words?

• …

16


17

Several implementations

• What I didn’t speak about:

• Numerical/Geo queries (Including operators)

• Advanced query syntax (boolean operators, proximity operators)

• Faceting & Aggregations (Categorization)

• Sharding (Horizontal scalability)

• Incremental indexing (Generational data structures)

• … (see u next time)

18

Missing subjects

Q/ANow or later [email protected]

mailto:[email protected]

architecture of a search engine

Technology

faz doc

index doc

bar foo doc

baz baz qux doc

query process nwords

query process example

search engines

set of documents