dataengconf sf16 - unifying real time and historical analytics with the lambda architecture
TRANSCRIPT
Turning Explorers into Discoverers
UNIFYING REAL TIME AND HISTORICAL ANALYTICS
WITH THE LAMBDA ARCHITECTURE
WHO YOU ARE
CTOs and VPEs Architects and Engineers Scientists and Analysts
Product
WHO I AMPeter Nachbaur
[email protected] / @PeterNachbaur
Currently Product and Sales at Keen IO
Past Analytics Platform Architect at at Keen IO Analytics Platform Engineer at WB Games
Java Developer at SmartBrief Cognitive Science at Vassar
AGENDA
Rise of Unified Analytics Lambda Architecture Overview
Lessons Learned at Keen
CHALLENGES IN DATA ENGINEERING
8 YEARS AGO
4 YEARS AGO
YESTERDAY
WHY NOT BOTH INDEED?
SMART DEVICES
MOBILE APPS
WEBSITES
TEAMS
CUSTOMERS
ANYWHERE
Keen IO Analytics API
insightsevents
THE STACKnginx
tornado play
kafka storm
cassandra
zookeeper memcached
redis mongo
flask react
c3
WE HAD A PROBLEM
HOW DID WE KNOW?
EXPERIENCING DIFFICULTIES
Cassandra data model
Inflexible Infrastructure Provider
Polyglot codebases
Scaling! 10x, 100x
RED QUEEN HYPOTHESIS
*WE* HAVE A PROBLEM
WHAT’S THE SOLUTION?
BUT, WAT DO?
DESIRED PROPERTIES
• robustness and fault tolerance • low latency reads (and updates) • generalization and extensibility • minimal maintenance and debuggability
LAMBDA ARCHITECTURE OVERVIEW
COMPLEXITY IS THE ENEMY OF PRODUCTIVITY
5 KEY CONCEPTS
1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier
HELPING CLOTHE THE WALRUS
1. PARALLEL INGESTION
2. BATCH LAYER
write once, bulk read often MASTER dataset creates denormalized batch views high latency
RAWNESS
IMMUTABILITY
PERPETUITY
IN BUSINESS FOR TWO YEARS! HOW MANY UNIQUE VISITORS PER
MONTH? SHIRT?
BATCH LAYER VIEWS
1 year = ~38,000,000 ranges of hours 1 year = 8760 hour buckets x1000 Shirts x1000000 Walruseses. Walri?
RECOMPUTATION VS INCREMENTAL
3. SERVING LAYER
batch updates -> batch views low latency, random reads no random writes! “stale” data simplicity
SHARD DATA INTELLIGENTLY
NEW SHOP DESIGN 6 HOURS AGO… HOW MANY UNIQUE VISITORS PER
MINUTE? SHIRT?
4. SPEED LAYER
low latency updates random writes AND reads stream processing incremental computation of transient views
SPEED LAYER OPTIONS
asynchronous or synchronous one-at-a-time or micro-batched
VIEW EXPIRATION
CUSTOMER FACING ANALYTICS… HOW MANY UNIQUE VISITORS PER
STATE? SHIRT?
5. UNIFIED QUERIES
batch view = function(master data)
realtime view = function(realtime view, new data)
query = function(batch view, realtime view)
5 KEY CONCEPTS
1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier
BONUS CONCEPT!
EVENTUAL ACCURACY
CONCEPTUAL CRITIQUES
ALTERNATIVE?
APACHE BEAM
REALITIES OF MIGRATION AND
LESSONS LEARNED
HOW TO START A MIGRATION
WHAT DID WE HAVE?
1. kafka 2. storm 3. cassandra batch-speed layer?
WHAT DID WE NEED?
10x, 100x data volumes More flexibility Reduced Operational Burden and TCO
ADDING BATCH LAYER
WHILE YOU’RE AT IT…
GOTCHYAS
CROSS-PROVIDER NETWORKING
TOOL VERSIONING
PARALLEL INGESTION
DELETES
QUERYLIB
CULTURAL DEBT
UNIFIED ANALYTICS
LAMBDA ARCHITECTURE
PRACTICE > THEORY
JOIN THE COMMUNITY!Analytics Slack Group -> keen.chat
Open source -> github.com/keen
Twitter -> @keen_io
IRL, right meow! Say hi to us! Ask more questions!
keen.io