dataengconf sf16 - unifying real time and historical analytics with the lambda architecture

61
Turning Explorers into Discoverers UNIFYING REAL TIME AND HISTORICAL ANALYTICS WITH THE LAMBDA ARCHITECTURE

Upload: hakka-labs

Post on 15-Apr-2017

301 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

Turning Explorers into Discoverers

UNIFYING REAL TIME AND HISTORICAL ANALYTICS

WITH THE LAMBDA ARCHITECTURE

Page 2: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHO YOU ARE

CTOs and VPEs Architects and Engineers Scientists and Analysts

Product

Page 3: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHO I AMPeter Nachbaur

[email protected] / @PeterNachbaur

Currently Product and Sales at Keen IO

Past Analytics Platform Architect at at Keen IO Analytics Platform Engineer at WB Games

Java Developer at SmartBrief Cognitive Science at Vassar

Page 4: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

AGENDA

Rise of Unified Analytics Lambda Architecture Overview

Lessons Learned at Keen

Page 5: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CHALLENGES IN DATA ENGINEERING

Page 6: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

8 YEARS AGO

Page 7: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

4 YEARS AGO

Page 8: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

YESTERDAY

Page 9: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHY NOT BOTH INDEED?

SMART DEVICES

MOBILE APPS

WEBSITES

TEAMS

CUSTOMERS

ANYWHERE

Keen IO Analytics API

insightsevents

Page 10: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

THE STACKnginx

tornado play

kafka storm

cassandra

zookeeper memcached

redis mongo

flask react

c3

Page 11: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture
Page 12: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WE HAD A PROBLEM

Page 13: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

HOW DID WE KNOW?

Page 14: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

EXPERIENCING DIFFICULTIES

Cassandra data model

Inflexible Infrastructure Provider

Polyglot codebases

Scaling! 10x, 100x

Page 15: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

RED QUEEN HYPOTHESIS

Page 16: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

*WE* HAVE A PROBLEM

Page 17: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHAT’S THE SOLUTION?

Page 18: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

BUT, WAT DO?

Page 19: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

DESIRED PROPERTIES

• robustness and fault tolerance • low latency reads (and updates) • generalization and extensibility • minimal maintenance and debuggability

Page 20: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

LAMBDA ARCHITECTURE OVERVIEW

Page 21: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

COMPLEXITY IS THE ENEMY OF PRODUCTIVITY

Page 22: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

5 KEY CONCEPTS

1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier

Page 23: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

HELPING CLOTHE THE WALRUS

Page 24: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture
Page 25: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

1. PARALLEL INGESTION

Page 26: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

2. BATCH LAYER

write once, bulk read often MASTER dataset creates denormalized batch views high latency

Page 27: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

RAWNESS

Page 28: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

IMMUTABILITY

Page 29: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

PERPETUITY

Page 30: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

IN BUSINESS FOR TWO YEARS! HOW MANY UNIQUE VISITORS PER

MONTH? SHIRT?

Page 31: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

BATCH LAYER VIEWS

1 year = ~38,000,000 ranges of hours 1 year = 8760 hour buckets x1000 Shirts x1000000 Walruseses. Walri?

Page 32: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

RECOMPUTATION VS INCREMENTAL

Page 33: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

3. SERVING LAYER

batch updates -> batch views low latency, random reads no random writes! “stale” data simplicity

Page 34: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

SHARD DATA INTELLIGENTLY

Page 35: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

NEW SHOP DESIGN 6 HOURS AGO… HOW MANY UNIQUE VISITORS PER

MINUTE? SHIRT?

Page 36: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

4. SPEED LAYER

low latency updates random writes AND reads stream processing incremental computation of transient views

Page 37: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

SPEED LAYER OPTIONS

asynchronous or synchronous one-at-a-time or micro-batched

Page 38: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

VIEW EXPIRATION

Page 39: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CUSTOMER FACING ANALYTICS… HOW MANY UNIQUE VISITORS PER

STATE? SHIRT?

Page 40: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

5. UNIFIED QUERIES

batch view = function(master data)

realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

Page 41: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

5 KEY CONCEPTS

1. Parallel Ingestion 2. Batch Layer 3. Serving Layer 4. Speed Layer 5. Query Unifier

Page 42: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

BONUS CONCEPT!

EVENTUAL ACCURACY

Page 43: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CONCEPTUAL CRITIQUES

Page 44: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

ALTERNATIVE?

Page 45: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

APACHE BEAM

Page 46: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

REALITIES OF MIGRATION AND

LESSONS LEARNED

Page 47: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

HOW TO START A MIGRATION

Page 48: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHAT DID WE HAVE?

1. kafka 2. storm 3. cassandra batch-speed layer?

Page 49: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHAT DID WE NEED?

10x, 100x data volumes More flexibility Reduced Operational Burden and TCO

Page 50: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

ADDING BATCH LAYER

Page 51: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

WHILE YOU’RE AT IT…

Page 52: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

GOTCHYAS

Page 53: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CROSS-PROVIDER NETWORKING

Page 54: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

TOOL VERSIONING

Page 55: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

PARALLEL INGESTION

Page 56: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

DELETES

Page 57: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

QUERYLIB

Page 58: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CULTURAL DEBT

Page 59: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

UNIFIED ANALYTICS

LAMBDA ARCHITECTURE

PRACTICE > THEORY

Page 60: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

CATCH { Q: QUESTIONS => …}

@PeterNachbaur [email protected]

Page 61: DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

JOIN THE COMMUNITY!Analytics Slack Group -> keen.chat

Open source -> github.com/keen

Twitter -> @keen_io

IRL, right meow! Say hi to us! Ask more questions!

keen.io