aws re:invent 2016: how to scale and operate elasticsearch on aws (dev307)

38
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Mahdi Ben Hamida - SignalFx 11/30/2016 DEV307 How to Scale and Operate Elasticsearch on AWS

Upload: amazon-web-services

Post on 11-Jan-2017

105 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Mahdi Ben Hamida - SignalFx

11/30/2016

DEV307

How to Scale and Operate

Elasticsearch on AWS

Page 2: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

What to Expect from the Session

• Elasticsearch (ES) usage at SignalFx

• What do we use ES for?

• How ES is deployed on AWS?

• Backup/restore of ES on Amazon S3

• Important ES/AWS metrics to monitor; what to alert on

• ES capacity planning

• Zero-downtime re-sharding

• SignalFx metadata storage architecture overview

• Scaling up and zero-downtime re-sharding on AWS

Page 3: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Elasticsearch at

Page 4: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

ES Usage

Ad-hoc queries Auto-complete Full-text search

Page 5: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Cluster Size

• 4 clusters in production on Amazon EC2

• Biggest cluster

• 54 data nodes, 3 master nodes, 6 client nodes deployed

across 3 AZs

• Over 1.3 billion unique documents

• 10+ TB of data

• 270 shards (primaries + replica)

• Sustained 75 QPS, 1K index/sec

Page 6: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

ES Deployment on AWS

• Dockerized ES 2.3/1.7 clusters. Orchestration done

using MaestroNG

• Biggest cluster

• Data nodes: i2.2xlarge – 16 GB heap (61GB total)

• Master nodes: m3.large – 2 GB heap (7.5GB total)

• Client nodes: m3.xlarge – 10 GB heap (15GB total)

• ES rack awareness to distribute primary and 2 replica

across 3 Availability Zones

Page 7: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Backup/Restore

• Made easy using the AWS Cloud plugin:PUT _snapshot/s3-repo { "type": "s3", "settings": { "bucket": ”signalfx-es-backups", "region": "us-east" } }

• Incremental backups

• Un-versioned S3 bucket

• VPC S3 endpoint to avoid bandwidth constraints

• Instance profiles for authentication to S3

• Cron job for hourly snapshots and weekly rotation

Page 8: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

ES Monitoring & Alerting

Page 9: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Key Performance Metrics

Page 10: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Key Detectors

• High CPU usage, low disk size

• Sustained high heap usage

• Master nodes availability

• Cluster state (green/yellow/red)

• Unassigned shards

• Thread pool rejections (search, bulk, index are the most

critical)

Page 11: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Always Test your ES Detectors/Alerts

Page 12: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Elasticsearch Capacity Planning

Page 13: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Capacity Factors

• Indexing

• CPU/IO utilization can be considerable

• Merges are CPU/IO intensive. Improved in ES 2.0

• Queries

• CPU load

• Memory load

Page 14: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

ES Sharding & Scale-up

1P

0R

0P

1R

node-1

node-2

1P

0P

node-1

node-2

0R

1R

node-3

node-4

1P

0P

node-1

node-2

0R

1R

node-3

node-4

0R

1R

node-5

node-6

Page 15: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Sizing Shards

• Create an index with one shard

• Simulate what you expect your indexing load to be –

measure CPU/IO load, find where it breaks

• Do the same with queries

• Determine disk consumption (average document size)

Page 16: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Zero-downtime Re-sharding

Page 17: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Why Re-shard?

• Required if you can’t scale up indexing by adding more

nodes

• If the index is read-only, you could implement a simpler

approach using aliases

• If the index is being written to, it’s more complicated

Page 18: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

service-A

metabase-client

mb-

server-1mb-

server-1metabase-1index-topic

write-topic

(1) enqueue write

(2) dequeue write

(3) write to C*

(4) enqueue index

(7) index document

(5) dequeue index

(6) read from C*

SignalFx’s Metadata Storage Architecture

Page 19: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Index Re-sharding Process

• Pre-requisites

• Phase 1: create target index

• Phase 2: bulk re-indexing

• Phase 3: double writing & change re-conciliation

• Phase 4: testing new index

• Phase 5: complete re-sharding process

Page 20: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Pre-requisite 1: readers query from an alias

myindex_v1

myindex readerreaderreader

Page 21: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Pre-requisite 2: indexing state +

generation number

myindex_v1

indexer generation: 42

extra: <null>

current: myindex_v1

Page 22: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

myindex_v2

Phase 1: create new index with updated

mappings

myindex_v1

indexer generation: 42

extra: <null>

current: myindex_v1

Page 23: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 2: increment generation, then start

bulk re- indexing of older generations

myindex_v1 myindex_v2_generation <= 42

indexer generation: 43

extra: <null>

current: myindex_v1

Page 24: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

During this step, documents may get

added/updated (or deleted*)

_generation <= 42

43

43

updated

created

indexer

myindex_v1

generation: 43

extra: <null>

current: myindex_v1

myindex_v2

Page 25: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Index state at the end of the bulk indexing

43

43

43

43

43

indexer

myindex_v1

generation: 43

extra: <null>

current: myindex_v1

myindex_v2

Page 26: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (a): enable double writ ing & bump

generation

43

43

43

43

43

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

43

Page 27: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (b): re- index documents at

generation 43

43

43

43

43

43

44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

43

44

Page 28: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (c): re- index documents at

generation 43

43

43

43

43

43

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

43

44 44

Page 29: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (c): re- index documents at

generation 43

43

43

43

43

43 43

44 44

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

43

44 44

Page 30: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (c): re- index documents at

generation 43

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

44 44

44 44

Page 31: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 3 – (e): perfect sync of both indices

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

44 44

44 44

Page 32: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 4: A/B testing of the new index

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

myindexreaderreaderreader

44 44

44 44

Page 33: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 4: swap read alias (or swap back !)

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

indexer

myindex_v2myindex_v1

generation: 44

extra: myindex_v2

current: myindex_v1

myindexreaderreaderreader

44 44

44 44

Page 34: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Phase 5: switch write index, generation,

stop double writ ing

43

43

43

43

43 43

44 44

44 44

44 44

44 44

44 44

44 44

44 44

45

indexer

45

45

45

myindex_v1

generation: 45

extra: <null>

current: myindex_v2

myindex_v2

44 44

44 44

Page 35: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Handling Failures

• Bulk re-indexing can fail (and it does); you don’t want to

re-start from scratch

• Use a “partition” field

• Migrate partition ranges

• Deletions could be a problem. We handle that by using

“deletion markers” instead then cleaning up

Page 36: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Performance Considerations

• Migrate using partition ranges to avoid holding segments

for a long time

• Add temporary nodes to handle the load

• Disable refreshes on the target index (so worth it!)

• Start with no replica (or one just in case)

• Avoid ”hot” shards by sorting on a field (a timestamp for

example)

• Have throttling controls to control indexing load

Page 37: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Thank you!

Sign-up for a free trial at

signalfx.com

Page 38: AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

Remember to complete

your evaluations!