elasticsearch from the trenches
TRANSCRIPT
![Page 2: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/2.jpg)
about meabout me
solution architect at slalomenjoy building search apps7+ years Lucene2+ years Hibernate Search~2 years Elasticsearch
![Page 3: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/3.jpg)
agendaagenda
the askinitial approachproblemsnext stepslessons learnedimprovementsquestions
![Page 4: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/4.jpg)
the askthe ask
search 6 billions docs in under 1.5 secindex 2 millions new docs / dayexport billions of docs to CSV filesindex and search docs in realtimeuse search throughout the application
free text searchfaceted navigationsuggestionsdashboards
![Page 5: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/5.jpg)
free text searchfree text search
![Page 6: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/6.jpg)
faceted navigationfaceted navigation
drill down
![Page 7: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/7.jpg)
suggestionssuggestions
![Page 8: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/8.jpg)
dashboardsdashboards
![Page 9: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/9.jpg)
hardwarehardwareused "large" serversservers had lots of CPUs & RAMnon-RAIDed spinning disks
5 dedicated nodesall nodes store dataall nodes are masterall nodes sort & aggregate
clustercluster
initial approachinitial approach
![Page 10: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/10.jpg)
shardsshardsused the default shard count5 primary + 1 replicaunlimited primary shards / node
indicesindicesdata was chronologicalused the time-based index strategy
weekly indices for transaction logsdaily indices for audit logs
initial approachinitial approach
![Page 11: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/11.jpg)
memorymemorydedicated 31 GB to the jvm heapused remaining memory for file system cacheturned off linux process swappingmaxed out linux file descriptorsused G1 Garbage Collector
initial approachinitial approach
index mappingsindex mappingsindexed all fieldsstored big documents with 60+ fieldsnested documentsparent-child relationships
![Page 12: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/12.jpg)
searchessearchessearched all indicesused query_string searchessearched all fieldssorted & aggregated on any fieldrange queriesparent-child queries
GET /index-*/_search
"query_string" : { "query": "+(eggplant | potato)", "default_field": "_all", "default_operator": "and"}
initial approachinitial approach
![Page 13: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/13.jpg)
problemsproblems
OutOfMemoryErrorfield data exceeded jvm heapshard count was in the thousandsgarbage collector could not free memory
CircuitBreakerExceptionfield data exceeded jvm heapsearch results exceeded jvm heap
slow searches (latency increased from seconds to minutes)nodes became unresponsivefrequent GC pauses
early signs
![Page 14: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/14.jpg)
cluster downcluster down
index corruptiondata loss
nodes failed to restart
![Page 15: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/15.jpg)
next stepsnext steps
shard capacityshard capacityunderstand data & searches
size based on actual usage
field datafield datamonitor
identify the producers
reduce usage
searchsearchidentify bottlenecks
optimize
clusterclusterfind failure points
make topology changes
make hardware changes
identify and fix problems...
![Page 16: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/16.jpg)
shard capacityshard capacity
1 shard can handle a lot of dataactually it held ~5x more datadidn't need 5 shards per indexdid't need weekly/daily indices
learned...learned...
shard is the unit of scalehow much data can a single shard hold?find the single shard breaking point
1. loaded a single shard with data
2. ran typical searches
3. recorded search response time
4. repeated until response time became unacceptable
![Page 17: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/17.jpg)
field datafield datawhich fields and indices are using a lot of field data?use the stats API to find out
fields used for sorting & aggregationhigh cardinality fieldsid-cache for parent-child relationshipsfield data is loaded first time field is accessedfield data is maintained per-indexfield data is not GC'd
culprits...culprits...
# Node Statscurl -XGET 'http://localhost:9200/_nodes/stats/indices/fielddata?human'
# Indices Statcurl -XGET 'http://localhost:9200/_stats/fielddata/?human'
![Page 18: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/18.jpg)
searchsearch
searching all indices is slow, CPUintensive and causes field data tobe loaded for every index
# Searches all indices/indexname-*/_search
# Search specific indices/indexname-2015/_search
query_string is flexible but allowsinefficient searches like leadingwildcard searches and searches_all fields by default
{ "query_string" : { "default_field" : "_all", "allow_leading_wildcard" "true", "query" : "this AND that OR thus" }}
what are the bottlenecks and resource killers?
![Page 19: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/19.jpg)
clustercluster
field data used up 70-90% of the heap memorynot much heap left for node & shard managementstop the world Garbage Collector (GC) pauses made thecluster unresponsivenodes dropped out of the clusterthe G1 GC had longer pauses than the CMS GCsorting, aggregations, id-cache for parent-childrelationships used up a lot of heap memorymanaging too many shards used a lot of heap memory
why is the cluster crashing?
![Page 20: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/20.jpg)
lessons learned...lessons learned...
number of shards / node should not exceed the number of CPU coresfigure out the single shard capacitymonitor field data usagefield data usage is permanent and does not get garbage collectedtoo high field data usage will bring down the clustersearch specific indices by target date rangetune and test all search API searchessplit cluster into data, client and master nodes use the default ES JVM settings and garbage collector
![Page 21: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/21.jpg)
hardwarehardwareused "large" serversservers had lots of CPUs & RAMnon-RAIDed spinning disksput master and client nodes on same servers
5 8 dedicated nodesall nodes are master dedicated master nodesall nodes store data dedicated data nodes all nodes sort & aggregate dedicated client nodes
clustercluster
improvementsimprovements
![Page 22: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/22.jpg)
shardsshardsdefault shard count didn't work5 1 primary + 1 replicaunlimited primary shards / node # of primaryshards less than # of CPU cores
indicesindicesdata was chronologicalused the time-based index strategyweekly monthly indices for transaction logsdaily monthly indices for audit logs
improvementsimprovements
![Page 23: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/23.jpg)
memorymemorydedicated 31 GB to the jvm heapused remaining memory for file system cacheturned off linux process swappingmaxed out linux file descriptorsused new G1 GC used stable CMS GC
improvementsimprovements
![Page 24: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/24.jpg)
index mappingsindex mappingsindexed all 40 fieldsstored big documents with 60+ fieldsnested documentsparent-child relationshipsused field aliases to define alternatefields used in sorting and aggregationused doc_value on sortable &aggregation fieldschanged boolean data type to string
"field": { "index": "no"}
# uses field data"fieldA": { "type": "boolean"}
# uses doc_value (no field data)"fieldA": { "type": "string", "index": "analyzed", "fields": { "raw" : { "type" : "string", "index" : "not_analyzed", "fielddata": { "format": "doc_values" } } }}
improvementsimprovements
![Page 25: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/25.jpg)
searchessearchessearch all indices target specificindices query_string simple_query_stringsearch on all some fieldssorting & aggregations on all lowcardinality fieldsrange queries filtersparent-child nested queriesadded query timeouts
GET /index-201501/_search
"simple_query_string" : { "query": "+(eggplant | potato)", "fields": ["field1", "field2"], "default_operator": "and"}
improvementsimprovements
![Page 26: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/26.jpg)
Questions?Questions?
![Page 27: Elasticsearch from the trenches](https://reader030.vdocuments.mx/reader030/viewer/2022032620/55cad46ebb61ebb9438b45af/html5/thumbnails/27.jpg)
Thank YouThank You