go big quick with elasticsearch

19
Go Big Quick Jason Scheller Platform & Content Analytics, Eikon

Upload: jason-scheller

Post on 18-Jun-2015

367 views

Category:

Technology


0 download

DESCRIPTION

Presentation at Elasticsearch-NY meetup in April http://www.meetup.com/Elasticsearch-NY/events/176640472/

TRANSCRIPT

Page 1: Go Big Quick with Elasticsearch

Go Big Quick

Jason SchellerPlatform & Content Analytics, Eikon

Page 2: Go Big Quick with Elasticsearch

Pricing & Text Analytics Platform

• Mission - Ingest, enrich, store, analyze everything. Provide a single platform for search and analytics capabilities over any hosted content. Serve as a platform for future innovation.

• Content

• Twitter (~675 Tweets/sec, 15 days history)

• News (~40 articles/sec, 18 months history)

• Research (40 million docs, 3 million/year)

• Filings (29 million docs, 2.5 million/year)

• Trade data (500k RICS, 30K/sec, 10 years)

• Various metadata and derived content sets

Page 3: Go Big Quick with Elasticsearch

Pricing & Text Analytics Platform

Page 4: Go Big Quick with Elasticsearch

Pricing & Text Analytics Platform

Page 5: Go Big Quick with Elasticsearch

Infrastructure

IBM Streams30 servers

18 servers86 TB

Page 6: Go Big Quick with Elasticsearch

Where to start?

Data

Page 7: Go Big Quick with Elasticsearch

Max Shard

Index

Shard 0

Data

JMeter

Max Shard• Disk space• Request load• RAM usage

Page 8: Go Big Quick with Elasticsearch

Maximum Shard Size

• This same experiment will also give you the ratio of data to index size, which is great for planning. Just make sure you’re using your real analyzer settings.

• The rest is just math!

• Don’t forget to account for:

• Memory required to facet & sort

• Replica shards

• Data compression

Max Total Index Size / Max Shard Size = # Nodes

Page 9: Go Big Quick with Elasticsearch

SPREADSHEET

But do I always use Max Shards?

Page 10: Go Big Quick with Elasticsearch

ALLOCATION & HARDWARE

Page 11: Go Big Quick with Elasticsearch

Cluster Allocation• Elasticsearch will figure out which node should host which shard. Let it! Its

better than you at figuring this out and moving shards around.

• Well mostly….

• Let’s say you have indices A – D, 4 shards each, 0 replicas, 4 nodes. Elasticsearch might arrange your shards like this based on the size of each shard.

A1

C1

B1

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Page 12: Go Big Quick with Elasticsearch

Cluster Allocation• But what about other considerations?

• Hot spotting

• Access frequency

• Connectivity for River-based ingestion

• Heterogeneous hardware

A1

C1

B1

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Page 13: Go Big Quick with Elasticsearch

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

Slow HW Slow HW Fast HW Fast HW

A1

C1

B1

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Page 14: Go Big Quick with Elasticsearch

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

• Not really. The slower machines will slow all queries to A & B. And I’m not getting my money’s worth from that better hardware!

Slow HW Slow HW Fast HW Fast HW

A1

C1

B1

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Page 15: Go Big Quick with Elasticsearch

Cluster Allocation – Heterogeneous Hardware• Wouldn’t this be better?

• Shard allocation settings allow us to “control” which nodes host which indices without ever specifying specific machines or IPs.

Slow HW Slow HW Fast HW Fast HW

A1C1 B1

C4

D4C2

B3A1B4A4

B2A2

D2C3D3

D1

Page 16: Go Big Quick with Elasticsearch

Cluster Allocation – Heterogeneous Hardware

Slow HW Slow HW Fast HW Fast HW

A1C1 B1

C4

D4C2

B3A1B4A4

B2A2

D2C3D3

D1

node.hardware: slow node.hardware: fast

Index.routing.allocation.require.hardware: fast

Node Settings Node Settings

Index Settings: A & B

Page 17: Go Big Quick with Elasticsearch

Cluster Allocation – Heterogeneous Hardware

Slow HW Fast HW Fast HW Fast HW

A1C1 B1

C4 D4

C2

B3A1

B4

A4

B2A2

D2C3D3

D1

• Is this ok? …Sure, why not?!

Page 18: Go Big Quick with Elasticsearch

Cluster Allocation – Archive Example• We can use the same feature for large data sets of a time-based feed. Say

we keep an index for all news ever. People are generally searching the most recent 12 months, not the last 30 years.

Slow HW

Slow HW

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW Slow

HWSlow HW

Slow HW

Slow HW Slow

HW Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Slow HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Fast HW

Page 19: Go Big Quick with Elasticsearch