aws re:invent 2016: how to scale and operate elasticsearch on aws (dev307)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mahdi Ben Hamida - SignalFx
11/30/2016
DEV307
How to Scale and Operate
Elasticsearch on AWS
What to Expect from the Session
• Elasticsearch (ES) usage at SignalFx
• What do we use ES for?
• How ES is deployed on AWS?
• Backup/restore of ES on Amazon S3
• Important ES/AWS metrics to monitor; what to alert on
• ES capacity planning
• Zero-downtime re-sharding
• SignalFx metadata storage architecture overview
• Scaling up and zero-downtime re-sharding on AWS
Elasticsearch at
ES Usage
Ad-hoc queries Auto-complete Full-text search
Cluster Size
• 4 clusters in production on Amazon EC2
• Biggest cluster
• 54 data nodes, 3 master nodes, 6 client nodes deployed
across 3 AZs
• Over 1.3 billion unique documents
• 10+ TB of data
• 270 shards (primaries + replica)
• Sustained 75 QPS, 1K index/sec
ES Deployment on AWS
• Dockerized ES 2.3/1.7 clusters. Orchestration done
using MaestroNG
• Biggest cluster
• Data nodes: i2.2xlarge – 16 GB heap (61GB total)
• Master nodes: m3.large – 2 GB heap (7.5GB total)
• Client nodes: m3.xlarge – 10 GB heap (15GB total)
• ES rack awareness to distribute primary and 2 replica
across 3 Availability Zones
Backup/Restore
• Made easy using the AWS Cloud plugin:PUT _snapshot/s3-repo { "type": "s3", "settings": { "bucket": ”signalfx-es-backups", "region": "us-east" } }
• Incremental backups
• Un-versioned S3 bucket
• VPC S3 endpoint to avoid bandwidth constraints
• Instance profiles for authentication to S3
• Cron job for hourly snapshots and weekly rotation
ES Monitoring & Alerting
Key Performance Metrics
Key Detectors
• High CPU usage, low disk size
• Sustained high heap usage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Thread pool rejections (search, bulk, index are the most
critical)
Always Test your ES Detectors/Alerts
Elasticsearch Capacity Planning
Capacity Factors
• Indexing
• CPU/IO utilization can be considerable
• Merges are CPU/IO intensive. Improved in ES 2.0
• Queries
• CPU load
• Memory load
ES Sharding & Scale-up
1P
0R
0P
1R
node-1
node-2
1P
0P
node-1
node-2
0R
1R
node-3
node-4
1P
0P
node-1
node-2
0R
1R
node-3
node-4
0R
1R
node-5
node-6
Sizing Shards
• Create an index with one shard
• Simulate what you expect your indexing load to be –
measure CPU/IO load, find where it breaks
• Do the same with queries
• Determine disk consumption (average document size)
Zero-downtime Re-sharding
Why Re-shard?
• Required if you can’t scale up indexing by adding more
nodes
• If the index is read-only, you could implement a simpler
approach using aliases
• If the index is being written to, it’s more complicated
service-A
metabase-client
mb-
server-1mb-
server-1metabase-1index-topic
write-topic
(1) enqueue write
(2) dequeue write
(3) write to C*
(4) enqueue index
(7) index document
(5) dequeue index
(6) read from C*
SignalFx’s Metadata Storage Architecture
Index Re-sharding Process
• Pre-requisites
• Phase 1: create target index
• Phase 2: bulk re-indexing
• Phase 3: double writing & change re-conciliation
• Phase 4: testing new index
• Phase 5: complete re-sharding process
Pre-requisite 1: readers query from an alias
myindex_v1
myindex readerreaderreader
Pre-requisite 2: indexing state +
generation number
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
myindex_v2
Phase 1: create new index with updated
mappings
myindex_v1
indexer generation: 42
extra: <null>
current: myindex_v1
Phase 2: increment generation, then start
bulk re- indexing of older generations
myindex_v1 myindex_v2_generation <= 42
indexer generation: 43
extra: <null>
current: myindex_v1
During this step, documents may get
added/updated (or deleted*)
_generation <= 42
43
43
updated
created
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
Index state at the end of the bulk indexing
43
43
43
43
43
indexer
myindex_v1
generation: 43
extra: <null>
current: myindex_v1
myindex_v2
Phase 3 – (a): enable double writ ing & bump
generation
43
43
43
43
43
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
Phase 3 – (b): re- index documents at
generation 43
43
43
43
43
43
44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
43
44 44
Phase 3 – (c): re- index documents at
generation 43
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
Phase 3 – (e): perfect sync of both indices
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
44 44
44 44
Phase 4: A/B testing of the new index
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreaderreaderreader
44 44
44 44
Phase 4: swap read alias (or swap back !)
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
indexer
myindex_v2myindex_v1
generation: 44
extra: myindex_v2
current: myindex_v1
myindexreaderreaderreader
44 44
44 44
Phase 5: switch write index, generation,
stop double writ ing
43
43
43
43
43 43
44 44
44 44
44 44
44 44
44 44
44 44
44 44
45
indexer
45
45
45
myindex_v1
generation: 45
extra: <null>
current: myindex_v2
myindex_v2
44 44
44 44
Handling Failures
• Bulk re-indexing can fail (and it does); you don’t want to
re-start from scratch
• Use a “partition” field
• Migrate partition ranges
• Deletions could be a problem. We handle that by using
“deletion markers” instead then cleaning up
Performance Considerations
• Migrate using partition ranges to avoid holding segments
for a long time
• Add temporary nodes to handle the load
• Disable refreshes on the target index (so worth it!)
• Start with no replica (or one just in case)
• Avoid ”hot” shards by sorting on a field (a timestamp for
example)
• Have throttling controls to control indexing load
Thank you!
Sign-up for a free trial at
signalfx.com
Remember to complete
your evaluations!