scalable data models with elasticsearch

Scalable Data Models with Elasticsearch

Elasticsearch Meetup | Amsterdam | April 7, 2016Maarten Roosendaal & Anne Veling

introduction• Anne Veling– Elasticsearch consultancy and custom

training– Performance and Stability

Troubleshooting– Software Architect, Team Lead

• Hierarchical data model, multiple levels

• High volume– searches– data changes

• Complex query requirements– Both Product and Offer fields in query– Facet on both levels

bol.com challenge

Products and Offers

faster indexing

faster searching

Test Data Creation• Node.js Script creating random data

– Product• Title: two random nouns from noun list• Category: pick one out 26 nouns• Half have no offer, half between 1-4

– Offer• Random price between 1-20• Seller: pick one out of 10k

• Stream in memory, flush out to disk in 3 flavors– Each flavor keeping its own bulk size of 100k– For 1M, 10M and 100M products

Document{

"seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1,"product": {

"id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime"

}}

Nested

Nested{

"_id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime","offers": [

{"seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1

}]

}

Parent/Child{

"_id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime”

}

{"_parent": "product95826""seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1

}

• Zipped data files– 1M: 86Mb– 10M: 860Mb– 100M: 8.6Gg

Getting it there

Indexing?

Indexing• 1M product set, local naive– 80s Document– 41s Nested– 64s Parent/Child

• ES index bottleneck:– Your source system and latency

it can slurp it up faster than you can serve it

Let’s take a break

Use CasesUse Case A Use Case B Use Case C

Product Search

Word in Title Word in Title∃ DeliveryC = 0

Word in Title∃ Price < P

Order By Relevance Relevance (Lowest) PriceDisplay for

top N products

Product FieldsCheapest Offer fields

Product FieldsCorrect Cheapest Offer fields

Product FieldsCheapest Offer fields

Aggregate On Category Category Category∀ Offer SellerId ∀ Correct Offers

SellerId∀ Correct Offers SellerId

∀ Offer Price ∀ Correct Offers Price

∀ Correct Offers Price

∀ Offer DeliveryCode

∀ Correct Offers DeliveryCode

∀ Correct Offers DeliveryCode

• Product• Offer

Use CasesD: query B, roll up by family• Families (with products

with offers)– with

product.title:lunchroom– filter by

product.offer.deliveryCode:tomorrow

Searching for a lunchroom

How hard can it be?

Let’s searchPOST /boltest1m_doc/_search -> 3046{ "query": { "term": { "product.title": { "value": "lunchroom" } } }}

POST /boltest1m_nested/_search -> 2026{ "query": { "term": { "title": { "value": "lunchroom" } } }}

POST /boltest1m_parentchild/_search -> 2022{ "query": { "has_parent": { "parent_type": "product", "query": { "term": { "title": { "value": "lunchroom" } } } } }}

ElasticSearch docs (and Lucene docs)

Product with Doc

Nested

Parent/Child

no offer 1 1 (1) 11 offer 1 1 (2) 22 offers 2 1 (3) 3

Real Queries• Add Details, Sorting• Product Facets– Category

• Offer Facets– Seller ID– Price Buckets– Delivery Code

Compare the numbers…Explain the differences...

A: Doc

A: Nested

A: Parent/Child

Query Tips• Use aggregations– Cardinality– top_hits ♥ (with top_score)• Smart Grouping & Field Collapsing• Slooooow 😢

– inner_hits• Don’t forget post-filtering or result

page lookup

Ice Cream Bounty

for making top_hits aggregation fast

Testing

Results

a b c d0

20406080

100120140160180200

1m tun 30102015 32 GB new queries

docnestedparentchild

a b c d0

500

1000

1500

2000

2500

3000

3500

10m tun 30102015 32 GB new queries

docnestedparentchild

Conclusions• Parent/Child has limitations– Combining cross-level queries with

aggregations in one go• Doc not as fast as we’d expected– Because we needed top_hits

aggregation• Elasticsearch scales predictably

Conclusions• For us, nested was the best solution• What is yours?• What are you searching for?–What are the rows?–What are the facets about?

Lessons Learned• Testing the scalability of your data

model– Fast iterations early on– Valuable insight in indexing and search

requirements

• Data Modeling is hard– Do it early–Make it fun

Tech Lessons Learned• Don’t forget to tune the ES cluster– Configure memory ;)

• If bulk file last line has no \n, gets ignored!– count the differences

• 100k bulk files with .000 suffixes ought to be enough for everyone, right?

• Do not underestimate Sneakernet

Thank You

@anneveling [email protected]