go big quick with elasticsearch

Go Big Quick

Jason SchellerPlatform & Content Analytics, Eikon

Pricing & Text Analytics Platform

• Mission - Ingest, enrich, store, analyze everything. Provide a single platform for search and analytics capabilities over any hosted content. Serve as a platform for future innovation.

• Content

• Twitter (~675 Tweets/sec, 15 days history)

• News (~40 articles/sec, 18 months history)

• Research (40 million docs, 3 million/year)

• Filings (29 million docs, 2.5 million/year)

• Trade data (500k RICS, 30K/sec, 10 years)

• Various metadata and derived content sets

Pricing & Text Analytics Platform

Infrastructure

IBM Streams30 servers

18 servers86 TB

Where to start?

Data

Max Shard

Index

Shard 0

Data

JMeter

Max Shard• Disk space• Request load• RAM usage

Maximum Shard Size

• This same experiment will also give you the ratio of data to index size, which is great for planning. Just make sure you’re using your real analyzer settings.

• The rest is just math!

• Don’t forget to account for:

• Memory required to facet & sort

• Replica shards

• Data compression

Max Total Index Size / Max Shard Size = # Nodes

SPREADSHEET

But do I always use Max Shards?

ALLOCATION & HARDWARE

Cluster Allocation• Elasticsearch will figure out which node should host which shard. Let it! Its

better than you at figuring this out and moving shards around.

• Well mostly….

• Let’s say you have indices A – D, 4 shards each, 0 replicas, 4 nodes. Elasticsearch might arrange your shards like this based on the size of each shard.

A1

C1

B1

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Cluster Allocation• But what about other considerations?

• Hot spotting

• Access frequency

• Connectivity for River-based ingestion

• Heterogeneous hardware

A1

C1

B1

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

Slow HW Slow HW Fast HW Fast HW

A1

C1

B1

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

• Not really. The slower machines will slow all queries to A & B. And I’m not getting my money’s worth from that better hardware!


A1

C1

B1

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Cluster Allocation – Heterogeneous Hardware• Wouldn’t this be better?

• Shard allocation settings allow us to “control” which nodes host which indices without ever specifying specific machines or IPs.


A1C1 B1

C4

D4C2

B3A1B4A4

B2A2

D2C3D3

D1

Cluster Allocation – Heterogeneous Hardware


A1C1 B1

C4

D4C2

B3A1B4A4

B2A2

D2C3D3

D1

node.hardware: slow node.hardware: fast

Index.routing.allocation.require.hardware: fast

Node Settings Node Settings

Index Settings: A & B

Cluster Allocation – Heterogeneous Hardware

Slow HW Fast HW Fast HW Fast HW

A1C1 B1

C4 D4

C2

B3A1

B4

A4

B2A2

D2C3D3

D1

• Is this ok? …Sure, why not?!

Cluster Allocation – Archive Example• We can use the same feature for large data sets of a time-based feed. Say

we keep an index for all news ever. People are generally searching the most recent 12 months, not the last 30 years.

Slow HW

Slow HW

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW Slow

HWSlow HW

Slow HW

Slow HW Slow

HW Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

go big quick with elasticsearch

Technology

slow node

fast index

allocation hardware

better hardware

shard allocation settings

good allocation

max shard index shard

max shards