pgroonga – make postgresql fast full text search platform for all languages!
TRANSCRIPT
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroongaMake PostgreSQL
fast full text search platformfor all languages!
Kouhei Sutou ClearCode Inc.PGConf.ASIA 2016
2016-12-03
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PostgreSQL and mePostgreSQL
Some my patches are
merged
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Patches
#13840: pg_dump generates unloadable SQLpg_dumpSQL
#14160: DROP ACCESS METHOD IF EXISTS isn't impl.DROP ACCESS METHOD IF EXISTS
They are found while developing PGroongaPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga dev stylePGroonga
When there are problems in related projects including PostgreSQLPostgreSQL
We fix these problems in these projects instead of choosing workaround in PGroongaPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PostgreSQL and FTSPostgreSQL
PostgreSQL has built-in full text search featurePostgreSQL
It has some problems...
We fixed them by PGroongaPGroonga
instead of fixing PostgreSQL PostgreSQL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Because...
Our approach is different from PostgreSQL's approachPGroongaPostgreSQL
1.
PG provides plugin systemPostgreSQL
Implementing as a plugin is PostgreSQL way!PostgreSQL
2.
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PG FTS problemPostgreSQL
Many langs aren't supported
e.g.: Asian languages
Japanese, Chinese and more
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS for Japanese11
SELECT to_tsvector('japanese', '');-- ERROR: text search configuration-- "japanese" does not exist-- LINE 2: to_tsvector('japanese',-- ^
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS for Japanese22
CREATE EXTENSION pg_trgm;SELECT show_trgm('');-- show_trgm -- ------------- {} Must not empty!-- (1 row)
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Existing solution
pg_bigm
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm
An extension
Similar to pg_trgmpg_trgm
Operator class for GINGIN
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm: Usagepg_bigm
CREATE INDEX index ON table USING GIN (column gin_bigm_ops);-- Use GIN Specify op class
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
pg_bigm: Demeritpg_bigm
Slow for large document(Normally, we want to use FTS for large document)
Because it needs "recheck"recheck
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
"recheck"
"Exact" seq. search after"loose" index search
The larger text, the slower
text = doc size * N docs = *
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
311 14706 20389
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Slow
Slow
Elapsed time (sec)
(Lower is better)
N hits
pg_bigm
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
New solution
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga
Pronunciation: pz:ln
An extension
Index and operator classes
Not operator classes for GINGIN
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga layer
GIN
textsearchpg_trgmpg_bigm
Index
Operatorclass
PGroonga
PGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
311 14706 20389
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Fast Fast
Elapsed time (sec)
(Lower is better)
N hits
PGroonga pg_bigm
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PostgreSQL doesn't support Asian languagesPostgreSQL
pg_bigm and PGroonga support all languagespg_bigmPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
Many hits case:
pg_bigm is slowpg_bigm
PGroonga is fastPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is PGroonga fast?PGroonga
Doesn't need "recheck"recheck
Is "recheck" really slow?recheck
See one more benchmark result
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Benchmark
0
0.5
1
1.5
2
2.5
3
0 100000 200000 300000 400000 500000
Data: Japanese Wikipedia(Many records and large documents)N records: About 0.9millionsAverage text size: 6.7KiB
Slow
Slow
Fast for many hits!
Query: ""
Elapsed time (sec)
(Lower is better)
N hits
PGroonga pg_bigm
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is pg_bigm fast?pg_bigm
Query is ""
Point: 2 characters2
pg_bigm doesn't need "recheck" for 2 chars querypg_bigm2recheck
It means that "recheck" is slowrecheck
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and "recheck"N-gramrecheck
N-gram approach needs "phrase search" when query has N+1 or more charactersN+1
N=2 for pg_bigm, N=3 for pg_trgmpg_bigmN=2pg_trgmN=3
GIN needs "recheck" for "phrase search"GINrecheck
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Phrase search
Phrase search is "token search" and "position check"
Tokens must exist and be ordered
OK: "car at" for "car at" query
NG: "at car" for "car at" query
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and phrase searchSplit text to tokens
"cat""ca","at"
1.
Search all tokens
"ca" & "at" exist: Candidate!
2.
Check appearance pos.
"ca" then "at": Found!
3.
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and GIN: CreateN-gramGIN
GIN
"ca","at"
Tokenize
Documents
catat car
10
20
ID Text"ca""at""t "
Token Posting list10,20
10,20
20" c""ar"
2020
"at","t "," c","ca","ar"
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram and GIN: SearchN-gramGIN
"ca""at""t "
Token Posting listGIN
10,20
10,20
20
cat Query
"ca","at"Tokenize
AND
catat car
10
20
DocumentsID Text
10,20
Candidates" c""ar"
2020
Search
Appearance position check(Point: Out of GIN)
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
GIN and phrase searchGIN
Phrase search needs position check
GIN doesn't support position checkGIN
GIN needs "recheck"Slow!GINrecheck
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why is PGroonga fast?PGroonga
PGroonga uses N-gram by defaultPGroongaN-gram
But doesn't need "recheck"PGroongarecheck
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Why no "recheck"?recheck
PGroonga usesfull
inverted indexPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Full inverted index
Including position
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Inverted index diff
catat car
1020
Documents
Full: Doc ID + pos
"ca""at""t "
ID Text
Token Posting list
20:2
10:2,20:110:1,20:4
"ca","at"1 2
"at","t "," c","ca","ar"1 2 3 4 5
Tokenize
Not full: Only doc ID
" c""ar"
20:3
20:5
"ca""at""t "
Token Posting list
20
10,2010,20
" c""ar"
20
20
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
N-gram/PGroonga: SearchN-gramPGroonga
"ca""at""t "
Token Posting listPGroonga
10:1,20:4
10:2,20:1
20:2
cat Query
TokenizeAND
catat car
10
20
DocumentsID Text 10Result
" c""ar"
20:320:5
Search
Appearance position check(Point: In PGroonga)
"ca","at"
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up
N-gram needs phrase searchN-gram
Full inverted index provides fast phrase search
GIN isn't full inverted indexGIN
PGroonga uses full inverted indexPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
FTS and English(*)
Normally, N-gram isn't used for English FTSN-gram
N-gram is slower than word based approach (textsearch approach)N-gramtextsearch
Stemming/stop word can't be usedN-gram
(*) EnglishAlphabet based languages
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga and EnglishPGroonga
PGroonga uses N-gram by defaultPGroongaN-gram
Is PGroonga slow for English?PGroonga
No. Similar to textsearchtextsearch
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga: SearchPGroonga
0
0.2
0.4
0.6
0.8
1
1.2
1.4
PostgreSQL OR MySQL database America
Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiB
Elapsed time (ms)
(Shorter is better)
Query
PGroonga textsearch
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga's N-gram
Variable size N-gramN-gram
Continuous alphabets are 1 token(= word based approach)1=
Hello"Hello" not "He","el",
No alphabet is 2-gram2-gram
"","",
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PGroonga's search is fast for all languagesPGroonga
Including alphabet based languages and Asian languages mixed case(textsearch doesn't support mixed case)textsearch
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
PGroonga makes PostgreSQLfast full text search platform
for all languages!PGroongaPostgreSQL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
More about PGroongaPGroonga
Performance
Japanese specific feature
JSON supportJSON
Replication
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Performance
Search and update
Index only scan
Direct Groonga searchGroonga
Index creation
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Search and update
Doesn't decrease search performance while updating
It's good characteristics for chat application
Zulip supports PGroongaZulip: OSS chat app by DropboxZulipPGroongaZulipDropboxOSS
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Characteristics
Sear
ch t
hrou
ghpu
t
Update throughput
PGroonga
Sear
ch t
hrou
ghpu
t
Update throughput
GIN
Keepsearch performancewhile many updates
Decreasesearch performancewhile updating
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Update and lock
Update without read locks
Write locks are required
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
GIN: Read/WriteGIN
Conn1Conn2
INSERTstart
SELECTstart
Blocked
INSERTfinish
SELECTfinish
GIN
Slow down!
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
PGroonga: Read/WritePGroonga
Conn1Conn2
INSERTstart
SELECTstart
INSERTfinish
SELECTfinish
PGroonga
No slow down!
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Fast stably
GIN has intermittent performance decrementsGIN
For details:"GIN pending list"GIN pending list
PGroonga keeps fast searchPGroonga
PGroonga keeps index updatedPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Index only scan
GIN: Not supportedGIN
PGroonga: SupportedPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
More faster search
Direct Groonga search is more fasterGroonga
Groonga: Full text search engine PGroonga usesGroongaPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Direct Groonga searchGroonga
0
0.2
0.4
0.6
0.8
1
1.2
1.4
PostgreSQL OR MySQL database America
Data: English Wikipedia(Many records and large docs)N records: About 5.3millionsAverage text size: 6.4KiBGroonga is 30x faster than others
Elapsed time (ms)
(Shorter is better)
Query
PGroonga Groonga textsearch
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Index creation time
0
0.5
1
1.5
2
2.5
3 Data: English WikipediaSize: About 33GiBMax text size: 1MiB
2x fasterthan textsearch
Elapsed time (hour)
(Shorter is better)
Module
Index creationPGroonga textsearch pg_trgm
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Performance: Wrap up
Keep fast search w/ update
Support index only scan
Direct Groonga search is more fasterGroonga
Fast index creation
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Japanese specific feature
Completion by Romaji
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Completion: Table
CREATE TABLE stations ( name text, readings text[] -- Support N readings);
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Completion: Data
INSERT INTO stations VALUES('Tokyo', ARRAY['']),-- In Katakana-- (...),('Akihabara', ARRAY['', '']);
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Completion: Index
CREATE INDEX pgroonga_index ON stations USING pgroonga ( -- For prefix and prefix Romaji/Katakana search name pgroonga.text_term_search_ops_v2, -- For prefix and prefix Romaji/Katakana search -- against array readings pgroonga.text_array_term_search_ops_v2);
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Completion: Search
SELECT name, readings FROM stationsWHERE name &^ 'tou' OR
-- Prefix search readings &^~> 'tou' -- Prefix Romaji/Katakana searchORDER BY name LIMIT 10;
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Completion: Result
Hit byprefix Romaji/Katakana search"tou"(Romaji)""(Katakana)
RK
name | readings -------+-------------- Tokyo | {}(1 row)
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
For Japanese: Wrap up
Support prefix Romaji/Kana searchRK
Useful for implementing auto complete feature in search box
Users don't need to convert Romaji to Kanji
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON supportJSON
Support full text search
Target: All texts in JSONJSON
Not only a text in a path(GIN supports only this style)GIN
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: FTS: DataJSONData
CREATE TABLE logs ( record jsonb);INSERT INTO logs (record) VALUES ('{"host": "app1"}'), ('{"message": "app is down"}');
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: FTS: IndexJSON
CREATE INDEX message_index ON logs USING GIN ((record->>'message') gin_trgm_ops);-- {"message": "HERE IS ONLY SEARCHABLE"}CREATE INDEX record_index ON logs USING pgroonga (record);-- All string values are searchable
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: FTS: GINJSONGIN
SELECT * FROM logs WHERE record->>'message' LIKE '%app%';-- {"host": "app1"} isn't target-- record -- ------------------------------ {"message": "app is down"}-- (1 row)
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: FTS: PGroongaJSONPGroonga
SELECT * FROM logs WHERE record @@ 'string @ "app"';-- All string values are target-- record -- ------------------------------ {"host": "app1"}-- {"message": "app is down"}-- (2 rows)
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
JSON: Wrap upJSON
Support full text search against all texts in JSONJSON
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Replication
Support with PG 9.6!PostgreSQL 9.6
PostgreSQL 9.6 ships "generic WAL"PostgreSQL 9.6generic WAL
Third party index can support WAL generationWAL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Implementation
Master: Encode action logs as MessagePackMessagePack
1.
Master: Write the action logs to WALWAL
2.
Slaves: Read the action logs and apply them
3.
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Overview
Master PostgreSQL
Index file
PGroonga DB
INSERT PGroonga
Update
Append action logsvia generic WAL API
Action log
Slave
Apply pending action logson SELECT
SELECT
WAL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: "action"
{ "_action": ACTION_ID}# ACTION_ID: 0: INSERT# ACTION_ID: 1: CREATE_TABLE# ACTION_ID: 2: CREATE_COLUMN# ACTION_ID: 3: SET_SOURCES
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: INSERTINSERT
{ "_action": 0, "_table": "TABLE_NAME", "ctid": PACKED_CTID_VALUE, "column1": COLUMN1_VALUE, ...}
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: Logs
{"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}{"_action": ACTION_ID, ...}...
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Write action logs
Index file Page
Header
Actionlogs
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Apply action logs
Index file PGroonga DB
Applied offset(Block#+Offset)
(2,10)
1 2
3 4Apply
(2,50) AfterBefore
1050
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Action log: Why msgpack?msgpack
Because MessagePack supports streaming unpackMessagePack
It's useful to stop applying action logs when WAL is applied partially on slavesWAL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Replication: Wrap up
Support with PG 9.6!PostgreSQL 9.6
Concept: Action logs on WALWAL
It'll be an useful pattern for out of PostgreSQL storage indexPostgreSQL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up11
PostgreSQL doesn't support FTS for all languagesPostgreSQL
PGroonga supports FTS for all languagesPGroonga
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up22
PGroonga is fast stablyPGroonga
PGroonga supports FTS for all texts in JSONPGroongaJSON
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up33
PGroonga supports replicationPGroonga
PostgreSQL 9.6 is requiredPostgreSQL 9.6
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
Wrap up44
PGroonga makes PostgreSQLfast full text search platform
for all languages!PGroongaPostgreSQL
-
PGroonga - Make PostgreSQL fast full text search platform for all languages! Powered by Rabbit 2.2.0
See also
https://pgroonga.github.io/
Tutorial: /tutorial/
Install: /install/
Reference: /reference/Includes replication doc and benchmark docs
Community: /community/
https://pgroonga.github.io/https://pgroonga.github.io/tutorial/https://pgroonga.github.io/install/https://pgroonga.github.io/reference/https://pgroonga.github.io/community/