druid at sf big analytics 2015-12-01
TRANSCRIPT
![Page 1: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/1.jpg)
DRUID INTERACTIVE EXPLORATORY ANALYTICS AT SCALE
GIAN MERLINO · DRUID COMMITTER · COFOUNDER @ IMPLY
![Page 2: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/2.jpg)
OVERVIEW MOTIVATION WHY DRUID? DEMO AN EXAMPLE APPLICATION ARCHITECTURE HIGH LEVEL OVERVIEW COMMUNITY CONTRIBUTE TO DRUID
![Page 3: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/3.jpg)
2013
HISTORY & MOTIVATION
‣ Druid was started in 2011 ‣ Power interactive data applications ‣ Multi-tenancy: lots of concurrent users ‣ Scalability: trillions events/day, sub-second queries ‣ Real-time analysis
![Page 4: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/4.jpg)
HISTORY & MOTIVATION
‣ Questions lead to more questions ‣ Dig into the dataset using filters, aggregates, and comparisons ‣ All interesting queries cannot be determined upfront
![Page 5: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/5.jpg)
DEMO
IN CASE THE INTERNET DIDN’T WORK PRETEND YOU SAW SOMETHING COOL
![Page 6: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/6.jpg)
2015
A GENERAL SOLUTION?
‣ Load all your data into Hadoop. Query it. Done! ‣ Good job guys, let’s go home
![Page 7: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/7.jpg)
2015
FINDING A SOLUTION
Hadoop
Even
t St
ream
s
Insi
ght
![Page 8: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/8.jpg)
2015
FINDING A SOLUTION
Hadoop (pre-processing and storage) Query Layer
Hadoop
Even
t St
ream
s
Insi
ght
![Page 9: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/9.jpg)
POSSIBLE SOLUTIONS
![Page 10: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/10.jpg)
2015
MAKE QUERIES FASTER‣ Optimizing business intelligence (OLAP) queries
• Aggregate measures over time, broken down by dimensions • Revenue over time broken down by product type • Top selling products by volume in San Francisco • Number of unique visitors broken down by age • Not dumping the entire dataset • Not examining individual events
![Page 11: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/11.jpg)
2015
FINDING A SOLUTION
Hadoop (pre-processing and storage) Sharded RDBMS?
Hadoop
Even
t St
ream
s
Insi
ght
![Page 12: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/12.jpg)
2015
‣ The idea • Row store • Star schema • Aggregate tables • Query cache
‣ But! • Scanning raw data is slow and expensive
GENERAL PURPOSE RDBMS
![Page 13: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/13.jpg)
2015
FINDING A SOLUTION
Hadoop (pre-processing and storage) NoSQL K/V Stores?
Hadoop
Even
t St
ream
s
Insi
ght
![Page 14: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/14.jpg)
2015
‣ Pre-computation • Pre-compute every possible query • Pre-compute a subset of queries • Exponential scaling costs
‣ Range scans • Primary key: dimensions/attributes • Value: measures/metrics (things to aggregate) • Still too slow!
KEY/VALUE STORES
![Page 15: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/15.jpg)
2015
FINDING A SOLUTION
Hadoop (pre-processing and storage) Column Stores
Hadoop
Even
t St
ream
s
Insi
ght
![Page 16: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/16.jpg)
2015
‣ Load/scan exactly what you need for a query ‣ Different compression algorithms for different columns ‣ Encoding for string columns ‣ Compression for measure columns
‣ Different indexes for different columns
COLUMN STORES
![Page 17: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/17.jpg)
DRUID
![Page 18: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/18.jpg)
2013
KEY FEATURES LOW LATENCY INGESTION
FAST AGGREGATIONS ARBITRARY SLICE-N-DICE CAPABILITIES
HIGHLY AVAILABLE APPROXIMATE & EXACT CALCULATIONS
DRUID
![Page 19: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/19.jpg)
DATA STORAGE
![Page 20: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/20.jpg)
2015
DATA!timestamp page language city country ... added deleted2011-01-01T00:01:35Z Justin Bieber en SF USA 10 652011-01-01T00:01:63Z Justin Bieber en SF USA 15 622011-01-01T01:02:51Z Justin Bieber en SF USA 32 452011-01-01T01:01:11Z Ke$ha en Calgary CA 17 872011-01-01T01:02:24Z Ke$ha en Calgary CA 43 992011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53...
![Page 21: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/21.jpg)
2015
PRE-AGGREGATION/ROLL-UP
timestamp page language city country ... added deleted2011-01-01T00:00:00Z Justin Bieber en SF USA 25 1272011-01-01T01:00:00Z Justin Bieber en SF USA 32 452011-01-01T01:00:00Z Ke$ha en Calgary CA 60 1862011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53...
timestamp page language city country ... added deleted2011-01-01T00:01:35Z Justin Bieber en SF USA 10 652011-01-01T00:01:63Z Justin Bieber en SF USA 15 622011-01-01T01:02:51Z Justin Bieber en SF USA 32 452011-01-01T01:01:11Z Ke$ha en Calgary CA 17 872011-01-01T01:02:24Z Ke$ha en Calgary CA 43 992011-01-01T02:03:12Z Ke$ha en Calgary CA 12 53...
![Page 22: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/22.jpg)
2015
PARTITION DATAtimestamp page language city country ... added deleted
2011-01-01T00:00:00Z Justin Bieber en SF USA 25 127
2011-01-01T01:00:00Z Justin Bieber en SF USA 32 452011-01-01T01:00:00Z Ke$ha en Calgary CA 60 186
2011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53
‣ Shard data by time ‣ Immutable blocks of data called “segments”
Segment 2011-01-01T02/2011-01-01T03
Segment 2011-01-01T01/2011-01-01T02
Segment 2011-01-01T00/2011-01-01T01
![Page 23: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/23.jpg)
2015
IMMUTABLE SEGMENTS‣ Fundamental storage unit in Druid ‣ No contention between reads and writes ‣ One thread scans one segment ‣ Multiple threads can access same underlying data
![Page 24: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/24.jpg)
2015
COLUMNAR STORAGE
‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
timestamp page language city country ... added deleted2011-01-01T00:01:35Z Justin Bieber en SF USA 10 652011-01-01T00:03:63Z Justin Bieber en SF USA 15 622011-01-01T00:04:51Z Justin Bieber en SF USA 32 452011-01-01T01:00:00Z Ke$ha en Calgary CA 17 872011-01-01T02:00:00Z Ke$ha en Calgary CA 43 992011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53...
![Page 25: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/25.jpg)
2013
COLUMN COMPRESSION · DICTIONARIES
‣ Create ids • Justin Bieber -> 0, Ke$ha -> 1
‣ Store • page -> [0 0 0 1 1 1] • language -> [0 0 0 0 0 0]
timestamp page language city country ... added deleted2011-01-01T00:01:35Z Justin Bieber en SF USA 10 652011-01-01T00:03:63Z Justin Bieber en SF USA 15 622011-01-01T00:04:51Z Justin Bieber en SF USA 32 452011-01-01T01:00:00Z Ke$ha en Calgary CA 17 872011-01-01T02:00:00Z Ke$ha en Calgary CA 43 992011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53...
![Page 26: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/26.jpg)
2013
BITMAP INDICES
‣ Justin Bieber -> [0, 1, 2] -> [111000] ‣ Ke$ha -> [3, 4, 5] -> [000111]
timestamp page language city country ... added deleted2011-01-01T00:01:35Z Justin Bieber en SF USA 10 652011-01-01T00:03:63Z Justin Bieber en SF USA 15 622011-01-01T00:04:51Z Justin Bieber en SF USA 32 452011-01-01T01:00:00Z Ke$ha en Calgary CA 17 872011-01-01T02:00:00Z Ke$ha en Calgary CA 43 992011-01-01T02:00:00Z Ke$ha en Calgary CA 12 53...
![Page 27: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/27.jpg)
2013
FAST AND FLEXIBLE QUERIES
JUSTIN BIEBER [1, 1, 0, 0]
KE$HA [0, 0, 1, 1]
JUSTIN BIEBER OR
KE$HA [1, 1, 1, 1]
row page0 Justin(Bieber1 Justin(Bieber2 Ke$ha3 Ke$ha
![Page 28: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/28.jpg)
ARCHITECTURE
![Page 29: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/29.jpg)
2015
ARCHITECTURE (BATCH ONLY)
Historical Node
Historical Node
Historical Node
HadoopData
Segments
![Page 30: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/30.jpg)
2015
‣ Main workhorses of a Druid cluster ‣ Respond to queries on segments ‣ Shared-nothing architecture
HISTORICAL NODES
![Page 31: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/31.jpg)
2015
ARCHITECTURE (BATCH ONLY)
Broker Node
Historical Node
Historical Node
Historical Node
Broker Node
QueriesHadoopData
Segments
![Page 32: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/32.jpg)
2015
‣ Knows which nodes hold what data ‣ Query scatter/gather (send requests to nodes and merge results) ‣ Caching
BROKER NODES
![Page 33: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/33.jpg)
2015
EVOLVING A SOLUTION
Hadoop (pre-processing and storage) Druid
Hadoop
Even
t St
ream
s
Insi
ght
![Page 34: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/34.jpg)
2015
MORE PROBLEMS‣ We’ve solved the query problem
• Druid gave us arbitrary data exploration & fast queries
‣ But what about data freshness? • Batch loading is slow! • We want “real-time” • Alerts, operational monitoring, etc.
![Page 35: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/35.jpg)
2015
FAST LOADING WITH DRUID‣ We have an indexing system ‣ We have a serving system that runs queries on data ‣ We can serve queries while building indexes! ‣ Real-time indexing workers do this
![Page 36: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/36.jpg)
2015
‣ Write-optimized data structure: hash map in heap
‣ Convert write optimized -> read optimized
‣ Read-optimized data structure: Druid segments
‣ Query data immediately
REAL-TIME NODES
Memory
Segment
Events
QueriesConvert
![Page 37: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/37.jpg)
2015
ARCHITECTURE (STREAMING-ONLY)
Broker Node
Historical Node
Historical Node
Historical Node
Broker Node
QueriesReal-time Nodes
Streaming Data
Segments
![Page 38: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/38.jpg)
2015
ARCHITECTURE (LAMBDA)
Broker Node
Historical Node
Historical Node
Historical Node
Broker Node
Queries
HadoopBatch Data
Segments
Real-time Nodes
Streaming Data
Segments
![Page 39: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/39.jpg)
2015
APPROXIMATE ANSWERS‣ Drastically reduce storage space and compute time
• Cardinality estimation • Histograms • Quantiles • Add your own proprietary modules
![Page 40: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/40.jpg)
2015
QUERY INTERFACE‣ Query libraries:
• JSON over HTTP • SQL • R • Python • Ruby • Perl
‣ UIs • Pivot • Grafana • Panoramix
![Page 41: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/41.jpg)
DRUID TODAY
![Page 42: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/42.jpg)
2015
THE COMMUNITY‣ Growing Community
• 130+ contributors from many different companies • In production at many different companies, we’re hoping for more!
• Ad-tech, network traffic, operations, activity streams, etc. • We love contributions!
![Page 43: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/43.jpg)
2015
PRODUCTION READY‣ High availability through replication ‣ Rolling restarts ‣ 4 years of no down time for software updates and restarts ‣ Battle tested ‣ Used by hundreds of companies in production
![Page 44: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/44.jpg)
2014
REALTIME INGESTION >3M EVENTS / SECOND SUSTAINED (200B+ EVENTS/DAY)
10 – 100K EVENTS / SECOND / CORE
DRUID IN PRODUCTION
![Page 45: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/45.jpg)
2014
CLUSTER SIZE>500TB OF SEGMENTS (>50 TRILLION RAW EVENTS)
>5000 CORES (>400 NODES, >100TB RAM)
IT’S CHEAPMOST COST EFFECTIVE AT THIS SCALE
DRUID IN PRODUCTION
![Page 46: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/46.jpg)
2014
0.0
0.5
1.0
1.5
0
1
2
3
4
0
5
10
15
20
90%ile
95%ile
99%ile
Feb 03 Feb 10 Feb 17 Feb 24time
quer
y tim
e (s
econ
ds)
datasource
a
b
c
d
e
f
g
h
Query latency percentiles
QUERY LATENCY (500MS AVERAGE) 90% < 1S 95% < 2S 99% < 10S
DRUID IN PRODUCTION
![Page 47: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/47.jpg)
2014
QUERY VOLUME SEVERAL HUNDRED QUERIES / SECOND
VARIETY OF GROUP BY & TOP-K QUERIES
DRUID IN PRODUCTION
![Page 48: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/48.jpg)
TAKE AWAYS
![Page 49: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/49.jpg)
2015
TAKE-AWAYS‣ When Druid?
• You want to power user-facing data applications • You want to do your analysis on data as it’s happening (realtime) • Arbitrary data exploration with sub-second ad-hoc queries • OLAP, BI, Pivot (anything involved aggregates) • You need availability, extensibility and flexibility
![Page 52: Druid at SF Big Analytics 2015-12-01](https://reader034.vdocuments.mx/reader034/viewer/2022042723/587069471a28ab48378b5b29/html5/thumbnails/52.jpg)
THANK YOU