real-time analytics with druid at appsflyer

Meet Druid!Real-time analytics with Druid at

Appsflyer

Publisher

Click

Install

Appsflyer Flow

Advertiser

Appsflyer as Marketing Platform

Fraud detection

StatisticsAttribution

Life time value

Retargeting

Prediction

A/B testing

Appsflyer Technology● ~8B events / day● Hundreds of machines in Amazon● Tens of micro-services

Apache Kafka

service

service

service

service

service

service

DBAmazon S3

MongoDB

Redshift

Druid

Realtime● New buzzword● Ingestion latency - seconds● Query latency - seconds

Analytics● Roll-up

○ Summarizing over a dimension● Drill-down

○ Focusing (zooming in)● Slicing and dicing

○ Reducing dimensions (slice)○ Picking values of specific dimensions (dice)

● Pivoting○ Rotating multi-dimensional cube

Analytics in 3D

We tried...● MongoDB

○ Operational issues○ Performance is not great

● Redshift○ Concurrency limits

● Aurora (MySQL)○ Aggregations are not optimized

● Memsql○ Insufficient performance○ Too pricy

● Cassandra○ Not flexible

Druid● Storage optimized for analytics● Lambda architecture inside● JSON-based query language● Developed by analytics SAAS company● Free and open source● Scalable to petabytes...

Druid Storage● Columnar● Inverted index● Immutable segments

Columnar Storage

Original data: 100MB

Queried columns: 10MB

Compressed: 3MB

Index● Values are dictionary encoded

{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}

● Bitmap for every dimension value (used by filters)

“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]

● Column values (used by aggregation queries)

[2, 1, 3, 15, 1, 1, 2, 8, 7]

Data Segments● Per time interval

○ Skip segments when querying

● Immutable○ Cache friendly○ No locking

● Versioned (MVCC)○ No locking○ Read-write concurrency

Data Ingestion

Real-time Data Historical Data

Broker

Streaming Hand-offBatch indexing

Query

Real-time Ingestion● Via Real-Time Node and Firehose

○ No redundancy or HA, thus not recommended

● Via Indexing Service and Tranquility API○ Core API○ Integrations with Streaming Frameworks○ HTTP Server○ Kafka Consumer

Batch Ingestion● File based (HDFS, S3, …)● Indexers

○ Internal Indexer■ For datasets < 1G

○ External Hadoop Cluster○ Spark Indexer

■ Work in progress

Ingestion Spec● Parsing configuration (Flat JSON, *SV)● Dimensions● Metrics● Granularity

○ Segment granularity○ Query granularity

● I/O configuration○ Where to read data from

● Tuning configuration○ Indexer tuning

● Partitioning and replication

Real-time ingestion

Task 1

Task 2

Interval Window

Time

Minimum indexing slots = Data sources x Partitions x Replicas x 2

Query Types● Group by

○ grouping by multiple dimensions

● Top N○ like grouping by a single dimension

● Timeseries○ w/o grouping over dimensions

● Search○ Dimensions lookup

● Time boundary○ Find available data timeframe

● Metadata queries

Tips for Querying● Prefer topN over groupBy● Prefer timeseries over topN and groupBy● Use limits (and priorities)

Query Spec● Data source● Dimensions● Interval● Filters● Aggregations● Post aggregations● Granularity● Context (query configuration)● Limit

Sample Query~# curl -X POST [email protected] -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty

{ "queryType": "groupBy", "dataSource": "inappevents", "granularity": "hour", "dimensions": ["media_source", "campaign"], "filter": { "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" }, { "type": "selector", "dimension": "country", "value": "RU" }] }, "aggregations": [ { "type": "count", "name": "events_count" }, { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" } ], "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]}

http://druidquery-20001-001-prod.eu1.appsflyer.com:8082/druid/v2?pretty

Caching● Historical node level

○ By segment

● Broker level○ By segment and query○ “groupBy” is disabled on purpose!

● By default - local caching● In production - use memcached

Load Rules● Can be defined

○ On data source○ On “tier”

● What can be set○ Replication factor○ Load period○ Drop period

● Can be used to separate “hot” data from “cold” one

Druid ComponentsHistorical Nodes

Real-time NodesCoordinator

Middle ManagerOverlord

Indexing Service

Broker Nodes

Deep Storage

Metadata Storage

Druid ComponentsHistorical Nodes

Real-time NodesCoordinator

Middle ManagerOverlord

Indexing Service

Broker Nodes

Deep Storage

Metadata Storage

Cache

Load Balancer

Druid Components (Explained)● Coordinator

○ Manages segments

● Real-time Nodes○ Pulling data in real-time, and indexing it

● Historical Nodes○ Keeps historical segments

● Overlord○ Accepts tasks and distributes them to Middle Managers

● Middle Manager○ Executes submitted tasks via Peons

● Broker Nodes○ Routes query to Real-time and Historical nodes, merges results

● Deep Storage○ Segments backup (HDFS, S3, …)

Failover● Coordinator and Overlord

○ HA

● Real-time nodes○ Tasks are replicated○ Pool of nodes

● Historical nodes○ Data is replicated○ Pool of nodes○ All segments are backed up in the deep storage

● Brokers○ Pool of nodes○ Load balancer at the front

Druid at Appsflyer

Druid Sink

S3S3

Druid Sink

Druid SinkTranquility API

Probably not needed anymore due to native support in Tranquility package

Druid in Production● Provisioning using Chef● r3.8xlarge (sample configuration is OK)● Redundancy for coordinator and overlord (node per AZ)● Historical and real-time nodes are spread between AZ● LB - Consul from Hashicorp● Service discovery - Consul again● Memcached● Monitoring via Graphite Emitter extension

○ https://github.com/druid-io/druid/pull/1978

● Alerting via Sensu

https://github.com/druid-io/druid/pull/1978


IAP Distribution● 3 different node types (instead of 6)● Unpack and run● Some useful wrappers● Built-in examples for quick start● Commercial support● PyQL, Pivot inside

http://imply.io

http://imply.io

http://imply.io

Tips● ZooKeeper is heavily used

○ Choose appropriate hardware/network for ZK machines

● Use latest version (0.8.3)○ Restartable tasks○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)○ Data sketches library

● All exceptions are useful


When Not to Choose Druid?● When data is not time-series● When data cardinality is high● When number of output rows is high● When setup costs must be avoided

Non-time Series Workarounds● Must have some timestamp still● Rebuild everything to order by your timestamp● Or, use single-dimension partitioning

○ Segments partitioned by timestamp first, then by dimension range○ Find optimal target segment size

Still, please don’t use Druid for non-time series!

Tools: Pivot

Tools: Panoramix

Thank you!

real-time analytics with druid at appsflyer

Engineering