text search with elasticsearch on aws

24
Text search with Elasticsearch on AWS Łukasz Przybyłek Tidio

Upload: lukasz-przybylek

Post on 13-Jan-2017

85 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Text search with Elasticsearch on AWS

Text search with Elasticsearch on AWSŁukasz PrzybyłekTidio

Page 2: Text search with Elasticsearch on AWS

What’s Elasticsearch?

● Search & analytics engine● Fast● Scalable● Distributed● Full text search capabilities● (near) Real time indexing● Document oriented● Schema free

Page 3: Text search with Elasticsearch on AWS

When do I need it?

● If needed faster search mechanism● If needed searching in large amount of data● If needed powerful full text queries

Page 4: Text search with Elasticsearch on AWS

How does it work?

Input Document Analyzer Terms Index

Page 5: Text search with Elasticsearch on AWS

Inverted Index

Id Content

1 The quick brown fox jumped over the lazy dog

2 Quick brown foxes leap over lazy dogs in summer

analysis

Term Doc_1 Doc_2

brown X X

dog X X

fox X X

in X

jump X X

lazy X X

over X X

quick X X

summer X

the X X

Page 6: Text search with Elasticsearch on AWS

Logical data structures

● Elasticsearch (cluster) contains indexes● Index contains types● Type contains documents● Mappings are assigned to types● Index aliases (optional) can point to indices and modify queries (e.g. add

filter)● There are no classic SQL-like relationships (!)

Page 7: Text search with Elasticsearch on AWS

Logical data structures

Cluster

Index IndexIndex

Type Type

Document

Map

ping

Document

Page 8: Text search with Elasticsearch on AWS

Physical data structures

● Cluster contains nodes● Index is stored in one or more shards (single shard is a Lucene index

instance)● Single node contains shards of different indexes

Page 9: Text search with Elasticsearch on AWS

How to deal with lack of joins?

● Denormalization● Client-side joins● Parent-child relationships

Page 10: Text search with Elasticsearch on AWS

Elasticsearch in Tidio

● Tidio Chat - business communication tool where business owners (operators) communicate with their customers (visitors)

● www.tidiochat.com● ES used instead of MariaDB to perform:

○ Fetching last conversations in project○ Perform search by message content and visitor email in project’s conversation history

Page 11: Text search with Elasticsearch on AWS

Relations in Tidio Chat

Message

id

visitor_id

operator_id

content

time

Project

public_key

Visitor

id

project_public_key

name

email

Operator

id

project_public_key

Page 12: Text search with Elasticsearch on AWS

Message document schema

● Project’s public key added to document● Search by email performed in MariaDB● Time mapped as date explicitly● Client-side join with Visitor

Message

id

visitor_id

operator_id

project_public_key

content

time

Page 13: Text search with Elasticsearch on AWS

Design decisions

● Questionsa. What indexes should be created?b. What types should be created?c. How shards should be distributed among nodes and indexes?

● Things to considera. Search in smaller dataset usually means faster search resultsb. Index with small number of shards does not scale efficiently to new nodes

c. Types are used mainly to assign mappings, they are not separated “search entities” so there is no direct performance boost from using many types

d. Index doesn’t need to represent domain entity

Page 14: Text search with Elasticsearch on AWS

Ideas?

Index for each project, one type inside index

● 250k projects = 250k indexes● Adding new index is slow● Large overhead associated with shards and indices count

Page 15: Text search with Elasticsearch on AWS

Ideas?

One index and separate type for each project

● Large index● Nodes scaling up only to number of shards in particular index (default 5, no

auto index splitting)● Every query would go through all shards and filter by project_public_key (large

amount of data to search in)

Page 16: Text search with Elasticsearch on AWS

Ideas?

Group projects and create an index for each group

● Limited amount of data to search in● Reasonable number of shards, which still can scale up to many nodes● Possibility to add alias for each project and search as it would be separate

index● Projects may be grouped by language and use specific analyzers

Page 17: Text search with Elasticsearch on AWS

Amazon Web Services Elasticsearch cluster

● Quick and easy to install● Extremely limited configuration options● Limited query options (scripts disabled)● Can be used with standard AWS authentication● There is no AWS SDK that supports ES, so users have to write code that sign

requests manually

Page 18: Text search with Elasticsearch on AWS

PHP clients for ES

● elasticsearch/elasticsearch○ https://github.com/elastic/elasticsearch-php○ Low level ES client○ One-to-one mapping with REST API○ Pluggable architecture (can use custom request handler and send AWS signed requests)

○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load balancing, Keep-Alive connections

○ Accepts queries in JSON

● ruflin/elastica○ https://github.com/ruflin/Elastica○ High level client○ Classes representing indices/queries/terms - you do not have to write JSONs

Page 19: Text search with Elasticsearch on AWS

Elasticsearch limitations

● Less capable than SQL● There is no paging support for aggregations

Page 20: Text search with Elasticsearch on AWS

AWS Elasticsearch limitations

● threadpool.bulk.queue_size=50● No script support

Page 21: Text search with Elasticsearch on AWS

Indexing performance

● Check your mappings!● Set fields as not analyzed ● Disable _all field● Tune your analyzer and index_options (advanced)

Page 22: Text search with Elasticsearch on AWS

Search performance

● Unfair comparison ● Over 26 million documents● Time of PHP requests in seconds

Query\Service MariaDB (8 CPU) Elasticsearch (4 CPU)

Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)

Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)

Page 23: Text search with Elasticsearch on AWS

Any questions?

Page 24: Text search with Elasticsearch on AWS

Thank [email protected]@gmail.com