elasticsearch - basics and beyond

CF Software Package

Ernesto ReigDamian McDonald

Elasticsearch – basics and beyond

Agenda

Introduction• Elasticsearch definition and key points• Inverted indexes

Cluster configuration and architecture• Shards and replica• Memory• SSD Disks• Logs• Cluster topology

Modeling the data• Mapping• Analysis• Handling relationships

JVM and Cluster monitoring

Introduction

Introduction (1): Elasticsearch definition and key points

Elasticsearch is not a NO-SQL databaseElasticsearch is not a Search Engine (uses Apache Lucene)Elasticsearch is a server used to search & analyze data in real time.• It is distributed, scalable and highly available.• It is meant for real-time search and analytics capabilities.• It comes with a sophisticated RESTful API.

3 key points in Elasticsearch:• Proper cluster configuration and architecture• Proper Data Mappings• Proper JVM and cluster monitoring

Elasticsearch is fragile, delicate, sensitive, frail and tricky

“With great power comes great responsibility” Benjamin Parker

Introduction (2): Apache Lucene Inverted indexes

1. Spiderman is my favourite hero2. Batman is a hero3. Ernesto is a hero better than Spiderman and Batman

Term Count DocsSpiderman 2 1, 3

is 3 1,2,3my 1 1favourite 1 1

hero 3 1,2,3Batman 2 2,3a 2 2,3Ernesto 1 3

better 1 3than 1 3and 1 3

Cluster configuration and architecture

Configuration (1): Shards and Replica

• Shard: Apache Lucene Index• Replica: copy of a shard• Elasticsearch Index: 1 or more shards

• Question 1: How many shards do we need? And how many replicas?• Question 2: Does it make sense to have one shard and its corresponding replica in the

same node?• Question 3: Is it useful having a 1-node cluster with "number_of_replicas": 1?• General rule:

– Max Number of nodes = number of shards * (number of replica + 1)

Configuration (2)

• Dedicated memory should not be more than 50% of the total memory available.– Example 16g:

• ./bin/elasticsearch -Xmx8g -Xms8g• export ES_HEAP_SIZE=8g

– Xms and max Xmx should be the same• Do not give more than 32 GB!

– ( http://www.elastic.co/guide/en/elasticsearch/guide/master/heap-sizing.html#compressed_oops)

• Enable mlockall to avoid memory swapping:– bootstrap.mlockall: true

• Use SSD disks• Change logs path:

– path.logs: /var/log/elasticsearch

http://www.elastic.co/guide/en/elasticsearch/guide/master/heap-sizing.html#compressed_oops

http://www.elastic.co/guide/en/elasticsearch/guide/master/heap-sizing.html#compressed_oops

Configuration (3): cluster topology (1)

• A well designed topology will make the cluster to:– Increase search speed– Reduce CPU consumption– Reduce memory consumption– Accept more concurrent requests per second– Reduce probability of split brain– Reduce probability of other errors in general.– Reduce hardware costs

• Data nodes and 2 types of non-data nodes:– data nodes

• http.enabled: false• node.data: true• node.master: false

– dedicated master nodes• http.enabled: false• node.data: false• node.master: true

– client nodes. Smart load balancers• http.enabled: true• node.data: false• node.master: false

Configuration (4): cluster topology (2)

With this configuration we can use machines with different hardware configuration for every type of node.This way we can save a lot of money invested in hardware!!

Example of cluster topology with 2 HTTP nodes, 2 master nodes and1 to X data nodes

Modeling the data

Modeling the data (1): Mapping

• Mapping is the process of defining how a document should be mapped to the Search Engine– Default Dynamic Mapping

• An index may store documents of different "mapping types”• Mapping types are a way to divide the documents in an index into logical

groups. Think of it as tables in a database• Components:

– Fields: _id, _type, _source, _all, _parent, _index, _size,…– Types: the datatype for each field in a document (eg strings, numbers, objects

etc)• Core Types: string, integer/long, float/double, boolean, and null.• Array• Object• Nested• IP• Geo Point• Geo Shape• Attachment

Modeling the data (2): Analysis

• Analysis is a process that consists of the following:– First, tokenizing a block of text into individual terms suitable for use in an inverted index,– Then normalizing these terms into a standard form to improve their “searchability,” or recall

• This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:– 0 or more Character filters– 1 Tokenizer– 0 or more Token filters

• Analysis is performed to both:– break indexed (analyzed) fields when a document is indexed– process query strings

• Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes.

Modeling the data (3): Analysis steps example

Original sentence: Batman & Robin aren´t my favourite heroes

BatmanandRobinaren´tmyfavouriteheroes

1st) Character filter: Batman and Robin aren´t my favourite heroes

2nd) Tokenizer:

3rd) Token Filter:

batman--robinarenmyfavouriteheroes

Indexed:

Modeling the data (4): Handling relationships

Handling relationships between entities is not as obvious as it is with a dedicated relational store. The golden rule of a relational database—normalize your data—does not apply to Elasticsearch.Four common techniques are used to manage relational data in Elasticsearch:• Application-side joins• Data denormalization• Nested objects• Parent/child relationships

PUT /my_index/user/1 { "name": "John Smith", "email": "[email protected]", "dob": "1970/10/24"}PUT /my_index/blogpost/2 { "title": "Relationships", "body": "It's complicated...", "user": 1 }

Modeling the data (5): Handling relationships – Application-side joins

We can (partly) emulate a relational database by implementing joins in our application:

Problem: This approach is only suitable when the first entity (the user in this example)has a small number of documents and, preferably, they seldom change.

PUT /my_index/user/1{ "name": "John Smith", "email": "[email protected]", "dob": "1970/10/24"}PUT /my_index/blogpost/2{ "title": "Relationships", "body": "It's complicated...", "user": { "id": 1, "name": "John Smith" }}

Modeling the data (6): Handling relationships – Data denormalization

Having redundant copies of data in each document that requires access to it removes the need for joins:

Problem: if we want to update the name, or remove a user object, we have to reindexalso the whole blogpost document.

PUT /my_index/blogpost/1{"title": "Nest eggs","body": "Making your money work...","tags": [ "cash", "shares" ],"comments": [ { "name": "John Smith", "comment": "Great article", "age": 28, "stars": 4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this please", "age": 31, "stars": 5, "date": "2014-10-22" }]}

Modeling the data (7): Handling relationships – Nested objects

Given the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it makes sense to store closely related entities within the same document:

Problem: As with denormalization, to update, add, or remove a nested object, we have to reindex the whole document also the whole blogpost document.

Find children by parent:

GET /company/employee/_search{ "query": { "has_parent": { "type": "branch", "query": { "match": { "country": "UK" } } } }}

Index a child document:

PUT /company{ "mappings": { "branch": {}, "employee": { "_parent": { "type": "branch" } } }}

Modeling the data (8): Handling relationships – Parent/child relationship

The parent-child functionality allows you to associate one document type with another, in a one-to-many relationship—one parent to many children. Advantages:• The parent document can be updated without reindexing the children.• Child documents can be added, changed, or deleted without affecting either the parent or other children.• Child documents can be returned as the results of a search request.

Find parents by children:

GET /company/branch/_search{ "query": { "has_child": { "type": "employee", "query": { “term": { “name": “John" } } } } }


• Servers CPU and disk usage• Elasticsearch logs• Elasticsearch plugins:

– Marvel– Bigdesk– Watcher

• Watch stats (http://localhost:9200/_stats)• JVM

– Jstat: jstat –gcutil es_pid 2000 1000 (ES pid with jps)– Visual JVM plugin– Memory dump – jmap

• Hot threads API

• Before going to production: Apache Jmeter tests!

Thank You

elasticsearch - basics and beyond

Data & Analytics