aws re:invent 2016: how to scale and operate elasticsearch on aws (dev307)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Mahdi Ben Hamida - SignalFx

11/30/2016

DEV307

How to Scale and Operate

Elasticsearch on AWS

What to Expect from the Session

• Elasticsearch (ES) usage at SignalFx

• What do we use ES for?

• How ES is deployed on AWS?

• Backup/restore of ES on Amazon S3

• Important ES/AWS metrics to monitor; what to alert on

• ES capacity planning

• Zero-downtime re-sharding

• SignalFx metadata storage architecture overview

• Scaling up and zero-downtime re-sharding on AWS

Elasticsearch at

ES Usage

Ad-hoc queries Auto-complete Full-text search

Cluster Size

• 4 clusters in production on Amazon EC2

• Biggest cluster

• 54 data nodes, 3 master nodes, 6 client nodes deployed

across 3 AZs

• Over 1.3 billion unique documents

• 10+ TB of data

• 270 shards (primaries + replica)

• Sustained 75 QPS, 1K index/sec

ES Deployment on AWS

• Dockerized ES 2.3/1.7 clusters. Orchestration done

using MaestroNG

• Biggest cluster

• Data nodes: i2.2xlarge – 16 GB heap (61GB total)

• Master nodes: m3.large – 2 GB heap (7.5GB total)

• Client nodes: m3.xlarge – 10 GB heap (15GB total)

• ES rack awareness to distribute primary and 2 replica

across 3 Availability Zones

Backup/Restore

• Made easy using the AWS Cloud plugin:PUT _snapshot/s3-repo { "type": "s3", "settings": { "bucket": ”signalfx-es-backups", "region": "us-east" } }

• Incremental backups

• Un-versioned S3 bucket

• VPC S3 endpoint to avoid bandwidth constraints

• Instance profiles for authentication to S3

• Cron job for hourly snapshots and weekly rotation

ES Monitoring & Alerting

Key Performance Metrics

Key Detectors

• High CPU usage, low disk size

• Sustained high heap usage

• Master nodes availability

• Cluster state (green/yellow/red)

• Unassigned shards

• Thread pool rejections (search, bulk, index are the most

critical)

Always Test your ES Detectors/Alerts

Elasticsearch Capacity Planning

Capacity Factors

• Indexing

• CPU/IO utilization can be considerable

• Merges are CPU/IO intensive. Improved in ES 2.0

• Queries

• CPU load

• Memory load

ES Sharding & Scale-up

1P

0R

0P

1R

node-1

node-2

1P

0P

node-1

node-2

0R

1R

node-3

node-4

1P

0P

node-1

node-2

0R

1R

node-3

node-4

0R

1R

node-5

node-6

Sizing Shards

• Create an index with one shard

• Simulate what you expect your indexing load to be –

measure CPU/IO load, find where it breaks

• Do the same with queries

• Determine disk consumption (average document size)

Zero-downtime Re-sharding

Why Re-shard?

• Required if you can’t scale up indexing by adding more

nodes

• If the index is read-only, you could implement a simpler

approach using aliases

• If the index is being written to, it’s more complicated

service-A

metabase-client

mb-

server-1mb-

server-1metabase-1index-topic

write-topic

(1) enqueue write

(2) dequeue write

(3) write to C*

(4) enqueue index

(7) index document

(5) dequeue index

(6) read from C*

SignalFx’s Metadata Storage Architecture

Index Re-sharding Process

• Pre-requisites

• Phase 1: create target index

• Phase 2: bulk re-indexing

• Phase 3: double writing & change re-conciliation

• Phase 4: testing new index

• Phase 5: complete re-sharding process

Pre-requisite 1: readers query from an alias

myindex_v1

myindex readerreaderreader

Pre-requisite 2: indexing state +

generation number

myindex_v1

indexer generation: 42

extra: <null>

current: myindex_v1

myindex_v2

Phase 1: create new index with updated

mappings

myindex_v1


extra: <null>

current: myindex_v1

Phase 2: increment generation, then start

bulk re- indexing of older generations

myindex_v1 myindex_v2_generation <= 42


extra: <null>

current: myindex_v1

During this step, documents may get

added/updated (or deleted*)

_generation <= 42

43

43

updated

created

indexer

myindex_v1

generation: 43

extra: <null>

current: myindex_v1

myindex_v2

Index state at the end of the bulk indexing

43

43

43

43

43

indexer

myindex_v1

generation: 43

extra: <null>

current: myindex_v1

myindex_v2

Phase 3 – (a): enable double writ ing & bump

generation

43

43

43

43

43

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

43

Phase 3 – (b): re- index documents at

generation 43

43

43

43

43

43

44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

43

44

Phase 3 – (c): re- index documents at

generation 43

43

43

43

43

43

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

43

44 44


generation 43

43

43

43

43

43 43

44 44

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

43

44 44


generation 43

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

44 44

44 44

Phase 3 – (e): perfect sync of both indices

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

44 44

44 44

Phase 4: A/B testing of the new index

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

myindexreaderreaderreader

44 44

44 44

Phase 4: swap read alias (or swap back !)

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer


generation: 44

extra: myindex_v2

current: myindex_v1

myindexreaderreaderreader

44 44

44 44

Phase 5: switch write index, generation,

stop double writ ing

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

45

indexer

45

45

45

myindex_v1

generation: 45

extra: <null>

current: myindex_v2

myindex_v2

44 44

44 44

Handling Failures

• Bulk re-indexing can fail (and it does); you don’t want to

re-start from scratch

• Use a “partition” field

• Migrate partition ranges

• Deletions could be a problem. We handle that by using

“deletion markers” instead then cleaning up

Performance Considerations

• Migrate using partition ranges to avoid holding segments

for a long time

• Add temporary nodes to handle the load

• Disable refreshes on the target index (so worth it!)

• Start with no replica (or one just in case)

• Avoid ”hot” shards by sorting on a field (a timestamp for

example)

• Have throttling controls to control indexing load

Thank you!

Sign-up for a free trial at

signalfx.com

Remember to complete

your evaluations!

aws re:invent 2016: how to scale and operate elasticsearch on aws (dev307)

Technology