real-time analytics with druid at appsflyer
TRANSCRIPT
![Page 1: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/1.jpg)
Meet Druid!Real-time analytics with Druid at
Appsflyer
![Page 2: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/2.jpg)
Publisher
Click
Install
Appsflyer Flow
Advertiser
![Page 3: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/3.jpg)
Appsflyer as Marketing Platform
Fraud detection
StatisticsAttribution
Life time value
Retargeting
Prediction
A/B testing
![Page 4: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/4.jpg)
Appsflyer Technology● ~8B events / day● Hundreds of machines in Amazon● Tens of micro-services
Apache Kafka
service
service
service
service
service
service
DBAmazon S3
MongoDB
Redshift
Druid
![Page 5: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/5.jpg)
Realtime● New buzzword● Ingestion latency - seconds● Query latency - seconds
![Page 6: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/6.jpg)
Analytics● Roll-up
○ Summarizing over a dimension● Drill-down
○ Focusing (zooming in)● Slicing and dicing
○ Reducing dimensions (slice)○ Picking values of specific dimensions (dice)
● Pivoting○ Rotating multi-dimensional cube
![Page 7: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/7.jpg)
Analytics in 3D
![Page 8: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/8.jpg)
We tried...● MongoDB
○ Operational issues○ Performance is not great
● Redshift○ Concurrency limits
● Aurora (MySQL)○ Aggregations are not optimized
● Memsql○ Insufficient performance○ Too pricy
● Cassandra○ Not flexible
![Page 9: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/9.jpg)
Druid● Storage optimized for analytics● Lambda architecture inside● JSON-based query language● Developed by analytics SAAS company● Free and open source● Scalable to petabytes...
![Page 10: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/10.jpg)
Druid Storage● Columnar● Inverted index● Immutable segments
![Page 11: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/11.jpg)
Columnar Storage
Original data: 100MB
Queried columns: 10MB
Compressed: 3MB
![Page 12: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/12.jpg)
Index● Values are dictionary encoded
{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}
● Bitmap for every dimension value (used by filters)
“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]
● Column values (used by aggregation queries)
[2, 1, 3, 15, 1, 1, 2, 8, 7]
![Page 13: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/13.jpg)
Data Segments● Per time interval
○ Skip segments when querying
● Immutable○ Cache friendly○ No locking
● Versioned (MVCC)○ No locking○ Read-write concurrency
![Page 14: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/14.jpg)
Data Ingestion
Real-time Data Historical Data
Broker
Streaming Hand-offBatch indexing
Query
![Page 15: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/15.jpg)
Real-time Ingestion● Via Real-Time Node and Firehose
○ No redundancy or HA, thus not recommended
● Via Indexing Service and Tranquility API○ Core API○ Integrations with Streaming Frameworks○ HTTP Server○ Kafka Consumer
![Page 16: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/16.jpg)
Batch Ingestion● File based (HDFS, S3, …)● Indexers
○ Internal Indexer■ For datasets < 1G
○ External Hadoop Cluster○ Spark Indexer
■ Work in progress
![Page 17: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/17.jpg)
Ingestion Spec● Parsing configuration (Flat JSON, *SV)● Dimensions● Metrics● Granularity
○ Segment granularity○ Query granularity
● I/O configuration○ Where to read data from
● Tuning configuration○ Indexer tuning
● Partitioning and replication
![Page 18: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/18.jpg)
Real-time ingestion
Task 1
Task 2
Interval Window
Time
Minimum indexing slots = Data sources x Partitions x Replicas x 2
![Page 19: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/19.jpg)
Query Types● Group by
○ grouping by multiple dimensions
● Top N○ like grouping by a single dimension
● Timeseries○ w/o grouping over dimensions
● Search○ Dimensions lookup
● Time boundary○ Find available data timeframe
● Metadata queries
![Page 20: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/20.jpg)
Tips for Querying● Prefer topN over groupBy● Prefer timeseries over topN and groupBy● Use limits (and priorities)
![Page 21: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/21.jpg)
Query Spec● Data source● Dimensions● Interval● Filters● Aggregations● Post aggregations● Granularity● Context (query configuration)● Limit
![Page 22: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/22.jpg)
Sample Query~# curl -X POST [email protected] -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty
{ "queryType": "groupBy", "dataSource": "inappevents", "granularity": "hour", "dimensions": ["media_source", "campaign"], "filter": { "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" }, { "type": "selector", "dimension": "country", "value": "RU" }] }, "aggregations": [ { "type": "count", "name": "events_count" }, { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" } ], "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]}
![Page 23: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/23.jpg)
Caching● Historical node level
○ By segment
● Broker level○ By segment and query○ “groupBy” is disabled on purpose!
● By default - local caching● In production - use memcached
![Page 24: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/24.jpg)
Load Rules● Can be defined
○ On data source○ On “tier”
● What can be set○ Replication factor○ Load period○ Drop period
● Can be used to separate “hot” data from “cold” one
![Page 25: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/25.jpg)
Druid ComponentsHistorical Nodes
Real-time NodesCoordinator
Middle ManagerOverlord
Indexing Service
Broker Nodes
Deep Storage
Metadata Storage
![Page 26: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/26.jpg)
Druid ComponentsHistorical Nodes
Real-time NodesCoordinator
Middle ManagerOverlord
Indexing Service
Broker Nodes
Deep Storage
Metadata Storage
Cache
Load Balancer
![Page 27: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/27.jpg)
Druid Components (Explained)● Coordinator
○ Manages segments
● Real-time Nodes○ Pulling data in real-time, and indexing it
● Historical Nodes○ Keeps historical segments
● Overlord○ Accepts tasks and distributes them to Middle Managers
● Middle Manager○ Executes submitted tasks via Peons
● Broker Nodes○ Routes query to Real-time and Historical nodes, merges results
● Deep Storage○ Segments backup (HDFS, S3, …)
![Page 28: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/28.jpg)
Failover● Coordinator and Overlord
○ HA
● Real-time nodes○ Tasks are replicated○ Pool of nodes
● Historical nodes○ Data is replicated○ Pool of nodes○ All segments are backed up in the deep storage
● Brokers○ Pool of nodes○ Load balancer at the front
![Page 29: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/29.jpg)
Druid at Appsflyer
Druid Sink
S3S3
![Page 30: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/30.jpg)
Druid Sink
Druid SinkTranquility API
Probably not needed anymore due to native support in Tranquility package
![Page 31: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/31.jpg)
Druid in Production● Provisioning using Chef● r3.8xlarge (sample configuration is OK)● Redundancy for coordinator and overlord (node per AZ)● Historical and real-time nodes are spread between AZ● LB - Consul from Hashicorp● Service discovery - Consul again● Memcached● Monitoring via Graphite Emitter extension
○ https://github.com/druid-io/druid/pull/1978
● Alerting via Sensu
![Page 32: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/32.jpg)
IAP Distribution● 3 different node types (instead of 6)● Unpack and run● Some useful wrappers● Built-in examples for quick start● Commercial support● PyQL, Pivot inside
http://imply.io
![Page 33: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/33.jpg)
Tips● ZooKeeper is heavily used
○ Choose appropriate hardware/network for ZK machines
● Use latest version (0.8.3)○ Restartable tasks○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)○ Data sketches library
● All exceptions are useful
![Page 34: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/34.jpg)
When Not to Choose Druid?● When data is not time-series● When data cardinality is high● When number of output rows is high● When setup costs must be avoided
![Page 35: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/35.jpg)
Non-time Series Workarounds● Must have some timestamp still● Rebuild everything to order by your timestamp● Or, use single-dimension partitioning
○ Segments partitioned by timestamp first, then by dimension range○ Find optimal target segment size
Still, please don’t use Druid for non-time series!
![Page 36: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/36.jpg)
Tools: Pivot
![Page 37: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/37.jpg)
Tools: Panoramix
![Page 38: Real-time analytics with Druid at Appsflyer](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587069511a28ab48378b5b81/html5/thumbnails/38.jpg)
Thank you!