real-time data pipelines with kafka, spark, and operational databases

Post on 18-Aug-2015

162 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Eric Frenkiel, MemSQL CEO and co-founder

August 11, 2015 • San Diego, CA

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

2

What’s In Store

MemSQL and a fresh look at Lambda architectures

Building real-time data pipelines for immediate impact

One architecture for many applications

3

MemSQL at a Glance

• Enable every company to be a real-time enterprise

• Founded 2011, based in San Francisco

• Founders are ex-Facebook, SQL Server engineers

• Deliver a database technology for modern architecture

Enterprise Focus

4

The Real-Time Database for Transactions and Analytics 

In-Memory Distributed Relational

Data CenterSoftware Cloud

5

Speed

Serving

Batch Fast Updates

Unified queries, full SQL

Fast Appends

A Fresh Look at Lambda Architectures

6

Comprehensive Architecture

Tra

nsac

tions

7

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTra

nsac

tions

8

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

Analytics

Tra

nsac

tions

9

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

10

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

Execution engine that spans the data spectrum

11

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

12

Building Real-Time Data Pipelines for Immediate Impact

By 2020, HP predicts that over a trillion sensors will be online

“The Internet of Things Will Drastically Change Our Future” – Datafloq

Going Real-Time is the Next Phase for Big Data

MoreDevices

More Interconnectivity

MoreUser Demand

…and companies are at risk of being left behind

ExpensiveNot scalableBatch onlySAN-burdened

1%

Success will be driven by real-time analytic applications.

17

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

18

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

19

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

20

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

21

Use Spark and Operational Databases Together

Spark Operational Databases

Interface Programatic Declarative

Execution Environment Job Scheduler SQL Engine and Query Optimizer

Persistent Storage Use another system Built-in

22

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

23

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

24

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

timehouse_i

dzip

device_id

device_type watts

2015-07-

06T16:43:40.33

Z

329280

94110 23‘kitchen_app

liance’60

… … … … … …

25

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

26

One Architecturefor Many Applications

27

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

28

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

Monitoring real-time Xfinity programming and video health

30

Collect streaming data at scale (hundreds of MemSQL machines)

Proactively diagnose issues Query ad-hoc and in real-time

with full SQL

From 30 minutes to less than 1 second

Real-time Analytics

Real-Time Trend Analytics

32

Massive Ingest and Concurrent Analytics

Instant accuracy to the latest repin Build real-time analytic applications

Real-time analytics

Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E

34

Real-Time

Segmentation

35

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQL

Legacy reportsMonitoring S3 (replay)

HDFS

Data Science

Vertica

Operational Data Store (ODS)

Star Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

37

Thank You!

Visit MemSQL at Booth #518

Real-Time Demos T-Shirt GiveawayGames

top related