text search with elasticsearch on aws
TRANSCRIPT
Text search with Elasticsearch on AWSŁukasz PrzybyłekTidio
What’s Elasticsearch?
● Search & analytics engine● Fast● Scalable● Distributed● Full text search capabilities● (near) Real time indexing● Document oriented● Schema free
When do I need it?
● If needed faster search mechanism● If needed searching in large amount of data● If needed powerful full text queries
How does it work?
Input Document Analyzer Terms Index
Inverted Index
Id Content
1 The quick brown fox jumped over the lazy dog
2 Quick brown foxes leap over lazy dogs in summer
analysis
Term Doc_1 Doc_2
brown X X
dog X X
fox X X
in X
jump X X
lazy X X
over X X
quick X X
summer X
the X X
Logical data structures
● Elasticsearch (cluster) contains indexes● Index contains types● Type contains documents● Mappings are assigned to types● Index aliases (optional) can point to indices and modify queries (e.g. add
filter)● There are no classic SQL-like relationships (!)
Logical data structures
Cluster
Index IndexIndex
Type Type
Document
Map
ping
Document
Physical data structures
● Cluster contains nodes● Index is stored in one or more shards (single shard is a Lucene index
instance)● Single node contains shards of different indexes
How to deal with lack of joins?
● Denormalization● Client-side joins● Parent-child relationships
Elasticsearch in Tidio
● Tidio Chat - business communication tool where business owners (operators) communicate with their customers (visitors)
● www.tidiochat.com● ES used instead of MariaDB to perform:
○ Fetching last conversations in project○ Perform search by message content and visitor email in project’s conversation history
Relations in Tidio Chat
Message
id
visitor_id
operator_id
content
time
Project
public_key
Visitor
id
project_public_key
name
Operator
id
project_public_key
Message document schema
● Project’s public key added to document● Search by email performed in MariaDB● Time mapped as date explicitly● Client-side join with Visitor
Message
id
visitor_id
operator_id
project_public_key
content
time
Design decisions
● Questionsa. What indexes should be created?b. What types should be created?c. How shards should be distributed among nodes and indexes?
● Things to considera. Search in smaller dataset usually means faster search resultsb. Index with small number of shards does not scale efficiently to new nodes
c. Types are used mainly to assign mappings, they are not separated “search entities” so there is no direct performance boost from using many types
d. Index doesn’t need to represent domain entity
Ideas?
Index for each project, one type inside index
● 250k projects = 250k indexes● Adding new index is slow● Large overhead associated with shards and indices count
Ideas?
One index and separate type for each project
● Large index● Nodes scaling up only to number of shards in particular index (default 5, no
auto index splitting)● Every query would go through all shards and filter by project_public_key (large
amount of data to search in)
Ideas?
Group projects and create an index for each group
● Limited amount of data to search in● Reasonable number of shards, which still can scale up to many nodes● Possibility to add alias for each project and search as it would be separate
index● Projects may be grouped by language and use specific analyzers
Amazon Web Services Elasticsearch cluster
● Quick and easy to install● Extremely limited configuration options● Limited query options (scripts disabled)● Can be used with standard AWS authentication● There is no AWS SDK that supports ES, so users have to write code that sign
requests manually
PHP clients for ES
● elasticsearch/elasticsearch○ https://github.com/elastic/elasticsearch-php○ Low level ES client○ One-to-one mapping with REST API○ Pluggable architecture (can use custom request handler and send AWS signed requests)
○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load balancing, Keep-Alive connections
○ Accepts queries in JSON
● ruflin/elastica○ https://github.com/ruflin/Elastica○ High level client○ Classes representing indices/queries/terms - you do not have to write JSONs
Elasticsearch limitations
● Less capable than SQL● There is no paging support for aggregations
AWS Elasticsearch limitations
● threadpool.bulk.queue_size=50● No script support
Indexing performance
● Check your mappings!● Set fields as not analyzed ● Disable _all field● Tune your analyzer and index_options (advanced)
Search performance
● Unfair comparison ● Over 26 million documents● Time of PHP requests in seconds
Query\Service MariaDB (8 CPU) Elasticsearch (4 CPU)
Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)
Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)
Any questions?
Thank [email protected]@gmail.com