scalable data models with elasticsearch
TRANSCRIPT
Scalable Data Models with Elasticsearch
Elasticsearch Meetup | Amsterdam | April 7, 2016Maarten Roosendaal & Anne Veling
introduction• Anne Veling– Elasticsearch consultancy and custom
training– Performance and Stability
Troubleshooting– Software Architect, Team Lead
• Hierarchical data model, multiple levels
• High volume– searches– data changes
• Complex query requirements– Both Product and Offer fields in query– Facet on both levels
bol.com challenge
Products and Offers
faster indexing
faster searching
Test Data Creation• Node.js Script creating random data
– Product• Title: two random nouns from noun list• Category: pick one out 26 nouns• Half have no offer, half between 1-4
– Offer• Random price between 1-20• Seller: pick one out of 10k
• Stream in memory, flush out to disk in 3 flavors– Each flavor keeping its own bulk size of 100k– For 1M, 10M and 100M products
Document{
"seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1,"product": {
"id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime"
}}
Nested
Nested{
"_id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime","offers": [
{"seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1
}]
}
Parent/Child{
"_id": "product95826","familyId": "family56744","title": "lunchroom representative","category": "crime”
}
{"_parent": "product95826""seller": "seller1203","price": 7,"stock": 2,"deliveryCode": 1
}
• Zipped data files– 1M: 86Mb– 10M: 860Mb– 100M: 8.6Gg
Getting it there
Indexing?
Indexing• 1M product set, local naive– 80s Document– 41s Nested– 64s Parent/Child
• ES index bottleneck:– Your source system and latency
it can slurp it up faster than you can serve it
Let’s take a break
Use CasesUse Case A Use Case B Use Case C
Product Search
Word in Title Word in Title∃ DeliveryC = 0
Word in Title∃ Price < P
Order By Relevance Relevance (Lowest) PriceDisplay for
top N products
Product FieldsCheapest Offer fields
Product FieldsCorrect Cheapest Offer fields
Product FieldsCheapest Offer fields
Aggregate On Category Category Category∀ Offer SellerId ∀ Correct Offers
SellerId∀ Correct Offers SellerId
∀ Offer Price ∀ Correct Offers Price
∀ Correct Offers Price
∀ Offer DeliveryCode
∀ Correct Offers DeliveryCode
∀ Correct Offers DeliveryCode
• Product• Offer
Use CasesD: query B, roll up by family• Families (with products
with offers)– with
product.title:lunchroom– filter by
product.offer.deliveryCode:tomorrow
Searching for a lunchroom
How hard can it be?
Let’s searchPOST /boltest1m_doc/_search -> 3046{ "query": { "term": { "product.title": { "value": "lunchroom" } } }}
POST /boltest1m_nested/_search -> 2026{ "query": { "term": { "title": { "value": "lunchroom" } } }}
POST /boltest1m_parentchild/_search -> 2022{ "query": { "has_parent": { "parent_type": "product", "query": { "term": { "title": { "value": "lunchroom" } } } } }}
ElasticSearch docs (and Lucene docs)
Product with Doc
Nested
Parent/Child
no offer 1 1 (1) 11 offer 1 1 (2) 22 offers 2 1 (3) 3
Real Queries• Add Details, Sorting• Product Facets– Category
• Offer Facets– Seller ID– Price Buckets– Delivery Code
Compare the numbers…Explain the differences...
A: Doc
A: Nested
A: Parent/Child
Query Tips• Use aggregations– Cardinality– top_hits ♥ (with top_score)• Smart Grouping & Field Collapsing• Slooooow 😢
– inner_hits• Don’t forget post-filtering or result
page lookup
Ice Cream Bounty
for making top_hits aggregation fast
Testing
Results
a b c d0
20406080
100120140160180200
1m tun 30102015 32 GB new queries
docnestedparentchild
a b c d0
500
1000
1500
2000
2500
3000
3500
10m tun 30102015 32 GB new queries
docnestedparentchild
Conclusions• Parent/Child has limitations– Combining cross-level queries with
aggregations in one go• Doc not as fast as we’d expected– Because we needed top_hits
aggregation• Elasticsearch scales predictably
Conclusions• For us, nested was the best solution• What is yours?• What are you searching for?–What are the rows?–What are the facets about?
Lessons Learned• Testing the scalability of your data
model– Fast iterations early on– Valuable insight in indexing and search
requirements
• Data Modeling is hard– Do it early–Make it fun
Tech Lessons Learned• Don’t forget to tune the ES cluster– Configure memory ;)
• If bulk file last line has no \n, gets ignored!– count the differences
• 100k bulk files with .000 suffixes ought to be enough for everyone, right?
• Do not underestimate Sneakernet
Thank You
@anneveling [email protected]