architecture of a search engine
DESCRIPTION
My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.TRANSCRIPT
Architecture of a Search Engine
Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia
• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search
• Databases
• Optimized for INSERT/UPDATE/DELETE/SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents (missing index?)
3
Why Search engines?
• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of matching documents
• And obviously, “LIKE '%foo bar%’" is not full-text search
4
Why Search engines?
5
Why Search engines?
Search
Push data periodically or
in realtime
Full-text search
Primary storage(DB, files, ...)
Search engine
Application
• Input = documents
• Composed by multiple attributes (textual, numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) + highlighting
6
How it works
• 2 distinct processes
• Indexing: storing documents in a highly optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation
• Indexing means building an “index“ or “inverted lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to documents
8
Implementation: Indexing process
9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3Indexing
Inverted lists
Documents Index
• Queries
• Goal = Retrieve all documents matching a user query
• Order results from the highest ranked to the lowest
10
Implementation: Query process
11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching documents
Pagination
• 1-word query = inverted lists intersection
12
Implementation: Query process
• N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching documents
Intersect inverted lists
Pagination
• But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein) !
• levenshtein(bar, baz) = 1 (substitution)!• levenshtein(bar, br) = 1 (deletion)!• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words would be too costly
13
Implementation: Query process
14
Implementation: Query process
• The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)Doc 2 (pos=3)
Doc 1 (pos=2)Doc 3 (pos=1)
Index
Doc 1 (pos=4)Doc 3 (pos=2)
b c
a o
r z o
f
15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)Doc 2 (pos=3)
Doc 1 (pos=2)Doc 3 (pos=1)
Index
Doc 1 (pos=4)Doc 3 (pos=2)
b c
a o
r z o
ffaz (distance=1)
faz (distance=0)faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)
• How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16
Implementation: Query process
17
Several implementations
• What I didn’t speak about:
• Numerical/Geo queries (Including operators)
• Advanced query syntax (boolean operators, proximity operators)
• Faceting & Aggregations (Categorization)
• Sharding (Horizontal scalability)
• Incremental indexing (Generational data structures)
• … (see u next time)
18
Missing subjects
Q/ANow or later [email protected]