elasticsearch for data engineers

ElasticsearchCrash Course for Data Engineers

Duy Do (@duydo)

About● A Father, A Husband, A Software Engineer● Founder of Vietnamese Elasticsearch Community● Author of Vietnamese Elasticsearch Analysis Plugin● Technical Consultant at Sentifi AG● Co-Founder at Krom● Follow me @duydo

https://www.facebook.com/groups/elasticsearchvn/

https://github.com/duydo/elasticsearch-analysis-vietnamese

https://twitter.com/duydo

https://twitter.com/duydo

Elasticsearch is Everywhere

What is Elasticsearch?

Elasticsearch is a distributed search and analytics engine, designed for horizontal scalability with easy management.

Basic Terms

● Cluster is a collection of nodes.● Node is a single server, part of a

cluster.● Index is a collection of shards ~

database.● Shard is a collection of

documents.● Type is a category/partition of an

index ~ table in database.● Document is a Json object ~

record in database.

Distributed & Scalable

Shards & Replicas

One node, One shard

Node 1

employees

P0

PUT /employees{

“settings”: {“number_of_shards”: 1,“number_of_replicas”: 0

}}

Two nodes, One shard

Node 1

employees

P0

PUT /employees{


}}

Node 2

One node, Two shards

Node 1

employees

P0

PUT /employees{


}}

P1

Two Nodes, Two Shards

Node 1

employees

P0

PUT /employees{


}}

Node 2

employees

P1P1

Two nodes, Two shards, One replica of each shard

Node 1

employees

P0

PUT /employees{


}}

R1

Node 2

employees

P1 R0

Index Management

Create IndexPUT /employees{

“settings”: {...},“mappings”: {

“type_one”: {...},“type_two”: {...}

},“aliases”: {

“alias_one”: {...},“alias_two”: {...}

}}

Index SettingsPUT /employees/_settings{

“number_of_replicas”: 1}

Index MappingsPUT /employees/_mappings{

“employee”: {“properties”: {

“name”: {“type”: “string”},“gender”: {“type”: “string”, “index”: “not_analyzed”},“email”: {“type”: “string”, “index”: “not_analyzed”},“dob”: {“type”: “date”},“country”: {“type”: “string”, “index”: “not_analyzed”},“salary”: {“type”: “double”},

}}

}

Delete IndexDELETE /employees

Put Data In, Get Data Out

Index a Document with ID

PUT /employees/employee/1{

“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0

}

Index a Document without ID

POST /employees/employee/{

“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0

}

Retrieve a Document

GET /employees/employee/1

Update a Document

POST /employees/employee/1/_update{

“doc”:{“salary”: 500.0

}}

Delete a Document

DELETE /employees/employee/1

Searching

Structured SearchDate, Times, Numbers, Text

● Finding Exact Values● Finding Multiple Exact Values● Ranges● Working with Null Values● Combining Filters

Finding Exact ValuesGET /employees/employee/_search{

“query”: {“term”: {

“country”: “VN”}

}}

SQL: SELECT * FROM employee WHERE country = ‘VN’;

Finding Multiple Exact ValuesGET /employees/employee/_search{

“query”: {“terms”: {

“country”: [“VN”, “US”]}

}}

SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;

RangesGET /employees/employee/_search{

“query”: {“range”: {

“dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”} }

}}

SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;

Working with Null valuesGET /employees/employee/_search{

“query”: {“filtered”: {

“filter”: {“exists”: {“field”: “email”}

} }

}}

SELECT * FROM employee WHERE email IS NOT NULL;

Working with Null ValuesGET /employees/employee/_search{


“filter”: {“missing”: {“field”: “email”}

} }

}}

SELECT * FROM employee WHERE email IS NULL;

Combining FiltersGET /employees/employee/_search{


“filter”: {“bool”: {

“must”:[{“exists”: {“field”: “email”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]

}}

}}

}

Combining FiltersSQL:

SELECT * FROM employeeWHERE email IS NOT NULL

AND gender != ‘female’AND (country = ‘VN’ OR country = ‘US’);

More Queries

● Prefix● Wildcard● Regex● Fuzzy● Type● Ids● ...

Full-Text SearchRelevance, Analysis

● Match Query● Combining Queries● Boosting Query Clauses

Match Query - Single WordGET /employees/employee/_search{

“query”: {“match”: {

“name”: {“query”: “Duy”

} }

}}

Match Query - Multi WordsGET /employees/employee/_search{

“query”: {“match”: {

“name”: {“query”: “Duy Do”,“operator”: “and”

} }

}}

Combining QueriesGET /employees/employee/_search{

“query”: {“bool”: {

“must”:[{“match”: {“name”: “Do”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]

}

}}

Boosting Query ClausesGET /employees/employee/_search{

“query”: {“bool”: {

“must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[

{“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important

{“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2],

}

}}

More Queries● Multi Match● Common Terms● Query Strings● ...

Analytics

AggregationsAnalyze & Summarize

● How many needles in the haystack?

● What is the average length of the needles?

● What is the median length of the needles, broken down by manufacturer?

● How many needles are added to the haystacks each month?

● What are the most popular needle manufacturers?

● ...

Buckets & Metrics

SELECT COUNT(country) # a metricFROM employeeGROUP BY country # a bucket

GET /employees/employee/_search{

“aggs”: {“by_country”: {

“terms”: {“field”: “country”}}

}}

Bucket is a collection of documents that meet certain criteria.

Metric is simple mathematical operations such as: min, max, mean, sum and avg.

CombinationBuckets & Metrics

● Partitions employees by country (bucket)

● Then partitions each country bucket by gender (bucket)

● Finally calculate the average salary for each gender bucket (metric)

Combination QueryGET /employees/employee/_search{

“aggs”: {“by_country”: { “terms”: {“field”: “country”},

“aggs”: {“by_gender”: { “terms”: {“field”: “gender”},

“aggs”: {“avg_salary”: {“avg”: “field”: “salary”}

}}

}}

}}

More Aggregations● Histogram● Date Histogram● Date Range● Filter/Filters● Missing● Geo Distance● Nested● ...

Best Practices

Indexing

● Use bulk indexing APIs.● Tune your bulk size 5-10MB.● Partitions your time series data

by time period (monthly, weekly, daily).

● Use aliases for your indices.● Turn off refresh, replicas while

indexing. Turn on once it’s done● Multiple shards for parallel

indexing.● Multiple replicas for parallel

reading.

Mapping● Disable _all field● Keep _source field, do not store

any field. ● Use not_analyzed if possible

Query

● Use filters instead of queries if possible.

● Consider orders and scope of your filters.

● Do not use string query.● Do not load too many results

with single query, use scroll API instead.

Kibana for Discovery, Visualization

Sense for Query

Marvel for Monitoring

elasticsearch for data engineers

Technology