postgresql search demystified
DESCRIPTION
How does a full-text search engine works? How is the index built and searched? Can I use PostgreSQL as a full-text search engine or should I go for a more specialised solution? How does one configure and use PostgreSQL search? This presentation covers all those aspects, based on the work we did to index teowaki.com. It was presented at PgConf EU 2014 in MadridTRANSCRIPT
![Page 1: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/1.jpg)
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com
![Page 2: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/2.jpg)
The problem
![Page 3: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/3.jpg)
our architecture
![Page 4: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/4.jpg)
![Page 5: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/5.jpg)
One does not simply
SELECT * from stuff where
content ilike '%postgresql%'
![Page 6: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/6.jpg)
![Page 7: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/7.jpg)
![Page 8: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/8.jpg)
Basic search features
* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance
![Page 9: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/9.jpg)
Nice to have features* partial searches
* search operators (OR, AND...)
* synonyms (postgres, postgresql, pgsql)
* thesaurus (OS=Operating System)
* fast, and space-efficient
* debugging
![Page 10: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/10.jpg)
Good News:
PostgreSQL supports all
the requested features
![Page 11: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/11.jpg)
Bad News:
unless you already know about search
engines, the official docs are not obvious
![Page 12: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/12.jpg)
How a search engine works
* An indexing phase
* A search phase
![Page 13: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/13.jpg)
The indexing phase
Convert the input text to tokens
![Page 14: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/14.jpg)
The search phase
Match the search terms to
the indexed tokens
![Page 15: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/15.jpg)
indexing in depth
* choose an index format
* tokenize the words
* apply token analysis/filters
* discard unwanted tokens
![Page 16: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/16.jpg)
the index format
* r-tree (GIST in PostgreSQL)
* inverse indexes (GIN in PostgreSQL)
* dynamic/distributed indexes
![Page 17: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/17.jpg)
dynamic indexes: segmentation
* sometimes the token index is
segmented to allow faster updates
* consolidate segments to speed-up
search and account for deletions
![Page 18: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/18.jpg)
tokenizing
* parse/strip/convert format
* normalize terms (unaccent, ascii,
charsets, case folding, number precision..)
![Page 19: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/19.jpg)
token analysis/filters
* find synonyms
* expand thesaurus
* stem (maybe in different languages)
![Page 20: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/20.jpg)
more token analysis/filters
* eliminate stopwords
* store word distance/frequency
* store the full contents of some fields
* store some fields as attributes/facets
![Page 21: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/21.jpg)
“the index file” is really
* a token file, probably segmented/distributed
* some dictionary files: synonyms, thesaurus,
stopwords, stems/lexems (in different languages)
* word distance/frequency info
* attributes/original field files
* optional geospatial index
* auxiliary files: word/sentence boundaries, meta-info,
parser definitions, datasource definitions...
![Page 22: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/22.jpg)
the hardest
part is now
over
![Page 23: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/23.jpg)
searching in depth* tokenize/analyse
* prepare operators
* retrieve information
* rank the results
* highlight the matched parts
![Page 24: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/24.jpg)
searching in depth: tokenize
normalize, tokenize, and analyse
the original search term
the result would be a tokenized, stemmed,
“synonymised” term, without stopwords
![Page 25: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/25.jpg)
searching in depth: operators
* partial search
* logical/geospatial/range operators
* in-sentence/in-paragraph/word distance
* faceting/grouping
![Page 26: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/26.jpg)
searching in depth: retrieval
Go through the token index files, use the
attributes and geospatial files if necessary
for operators and/or grouping
You might need to do this in a distributed way
![Page 27: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/27.jpg)
searching in depth: ranking
algorithm to sort the most relevant results:
* field weights
* word frequency/density
* geospatial or timestamp ranking
* ad-hoc ranking strategies
![Page 28: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/28.jpg)
searching in depth: highlighting
Mark the matching parts of the results
It can be tricky/slow if you are not storing the full contents
in your indexes
![Page 29: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/29.jpg)
PostgreSQL as a
full-text
search engine
![Page 30: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/30.jpg)
search features
* index format configuration
* partial search
* word boundaries parser (not configurable)
* stemmers/synonyms/thesaurus/stopwords
* full-text logical operators
* attributes/geo/timestamp/range (using SQL)
* ranking strategies
* highlighting
* debugging/testing commands
![Page 31: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/31.jpg)
indexing in postgresql
you don't actually need an index to use full-text search in PostgreSQL
but unless your db is very small, you want to have one
Choose GIST or GIN (faster search, slower indexing,
larger index size)
CREATE INDEX pgweb_idx ON pgweb USING
gin(to_tsvector(config_name, body));
![Page 32: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/32.jpg)
Two new things
CREATE INDEX ... USING gin(to_tsvector (config_name, body));
* to_tsvector: postgresql way of saying “tokenize”
* config_name: tokenizing/analysis rule set
![Page 33: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/33.jpg)
Configuration
CREATE TEXT SEARCH CONFIGURATION
public.teowaki ( COPY = pg_catalog.english );
![Page 34: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/34.jpg)
Configuration
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = spanglish
);
CREATE TEXT SEARCH DICTIONARY spanish_ispell (
TEMPLATE = ispell,
DictFile = es_any,
AffFile = es_any,
StopWords = spanish
);
![Page 35: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/35.jpg)
Configuration
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
CREATE TEXT SEARCH DICTIONARY spanish_stem (
TEMPLATE= snowball,
Language = spanish,
Stopwords = spanish
);
![Page 36: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/36.jpg)
Configuration
Parser.
Word boundaries
![Page 37: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/37.jpg)
Configuration
Assign dictionaries (in specific to generic order)
ALTER TEXT SEARCH CONFIGURATION teowaki
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,
hword_part
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION teowaki
DROP MAPPING FOR email, url, url_path, sfloat, float;
![Page 38: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/38.jpg)
debugging
select * from ts_debug('teowaki', 'I am searching unas
b squedas con postgresql database');ú
also ts_lexize and ts_parser
![Page 39: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/39.jpg)
tokenizing
tokens + position (stopwords are removed, tokens are folded)
![Page 40: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/40.jpg)
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres');
![Page 41: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/41.jpg)
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres:*');
![Page 42: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/42.jpg)
operators
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres | mysql');
![Page 43: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/43.jpg)
ranking weights
SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||
setweight(to_tsvector(coalesce(description,'')),'B')
from wakis limit 1;
![Page 44: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/44.jpg)
search by weight
![Page 45: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/45.jpg)
ranking
SELECT name, ts_rank(to_tsvector(name), query) rank
from wakis, to_tsquery('postgres | indexes') query
where to_tsvector(name) @@ query order by rank DESC;
also ts_rank_cd
![Page 46: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/46.jpg)
highlighting
SELECT ts_headline(name, query) from wakis,
to_tsquery('teowaki', 'game|play') query
where to_tsvector('teowaki', name) @@ query;
![Page 47: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/47.jpg)
USE POSTGRESQL
FOR EVERYTHING
![Page 48: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/48.jpg)
When PostgreSQL is not good
* You need to index files (PDF, Odx...)
* Your index is very big (slow reindex)
* You need a distributed index
* You need complex tokenizers
* You need advanced rankers
![Page 49: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/49.jpg)
When PostgreSQL is not good
* You want a REST API
* You want sentence/ proximity/ range/
more complex operators
* You want search auto completion
* You want advanced features (alerts...)
![Page 50: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/50.jpg)
But it has been
perfect for us so far.
Our users don't care
which search engine
we use, as long as
it works.
![Page 51: Postgresql search demystified](https://reader034.vdocuments.mx/reader034/viewer/2022042515/547e7b70b4af9fbe158b57bd/html5/thumbnails/51.jpg)
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com