state management in apache flink : consistent stateful distributed stream processing

81
Paris Carbone<[email protected]> - KTH Royal Institute of Technology Stephan Ewen<[email protected]> - data Artisans Gyula Fóra<[email protected]> - King Digital Entertainment Ltd Seif Haridi<[email protected]> - KTH Royal Institute of Technology Stefan Richter<[email protected]> - data Artisans Kostas Tzoumas<[email protected]> - data Artisans 1 State Management in Apache Flink ® Consistent Stateful Distributed Stream Processing @vldb17

Upload: paris-carbone

Post on 17-Mar-2018

562 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Paris Carbone<[email protected]> - KTH Royal Institute of Technology Stephan Ewen<[email protected]> - data Artisans Gyula Fóra<[email protected]> - King Digital Entertainment Ltd Seif Haridi<[email protected]> - KTH Royal Institute of Technology Stefan Richter<[email protected]> - data Artisans Kostas Tzoumas<[email protected]> - data Artisans

1

State Management in Apache Flink®

Consistent Stateful Distributed Stream Processing

@vldb17

Page 2: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Overview

• The Apache Flink System Architecture

• Pipelined Consistent Snapshots

• Operations with Snapshots

• Large Scale Deployments and Evaluation

2

Page 3: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

The Apache Flink Framework

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

3

Page 4: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

4

Page 5: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

4

Page 6: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

optimised logical graph

4

Page 7: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

4

Page 8: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

4

Page 9: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

4

Page 10: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

locally managed state

4

Page 11: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

locally managed state

ExternalSnapshot Store(e.g., hdfs)

partial snapshots

4

Page 12: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots

5

Page 13: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots

6

Page 14: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Stateful Processing

tasktasktask

7

Page 15: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Stateful Processing

tasktasktask

invoke per input record

7

Page 16: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

7

Page 17: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Local State Backend

physicaloperations

In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store

(RocksDB)

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

7

Page 18: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Local State Backend

physicaloperations

In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store

(RocksDB)

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

state = f(input)

7

Page 19: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

8

Page 20: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

local statesinput

streams

8

Page 21: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

local statesinput

streams

stream processor

8

Page 22: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

local statesinput

streams

divide computationinto epochs

stream processor

8

Page 23: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

local statesinput

streams

capture all local states after completing an epoch

divide computationinto epochs

stream processor

8

Page 24: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

local statesinput

streams

capture all local states after completing an epoch

divide computationinto epochs

stream processor

can rollback input and state to captured point in the past

8

Page 25: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 26: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 27: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 28: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 29: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 30: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 31: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

Page 32: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

• In use: Storm Trident and Spark Streaming

• A conservative approach, equivalent to batching

• Can cause unnecessary latency (master coordination)

• Processing is no longer continuous

• Forces many tasks to be idle

• Instead, in Apache Flink snapshots are pipelined

Synchronous Snapshots

10

Page 33: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

11

Page 34: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

insert markers

11

Page 35: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

insert markers

A

BC D

E

11

Page 36: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

11

Page 37: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B

11

Page 38: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

epoch alignment

A

BC D

E

B

11

Page 39: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

epoch alignment

A

BC D

E

B A

11

Page 40: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B A C

11

Page 41: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B A C D E

11

Page 42: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots

Snapshot Store

async state copysnapshotcompletes

A

BC D

E

B A C D E

11

Page 43: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots (cycles)

12

Page 44: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots (cycles)

Problem: we cannot wait indefinitely for records in cycles

12

Page 45: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Pipelined Snapshots (cycles)

Problem: we cannot wait indefinitely for records in cycles

Solution: log in snapshot inflight records within a cycle

Replay upon recovery. 12

Page 46: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

• Offers exactly-once processing guarantees

• Issued periodically/externally by the user

• Naturally respects flow control mechanisms

• Channel state logging limited to cycles only

• Multiple epoch snapshots can be pipelined

• Can offer weaker at-least-once processing guarantees by simply dropping aligning vs no alignment cost

Technique Highlights

13

Page 47: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

14

Page 48: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Exactly-Once: Input and Processing

Important Assumptions

• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)

• Data Channels are FIFO and reliable (no loss)

Each epoch either completes or repeats

15

Page 49: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

• Idempontency ~ repeated operations can be tolerated after recovery/rollback (works for mutable stores).

• Transactional Processing ~ Requires a two-phase coordination. A snapshot completion eventually leads to external commit (e.g., Flink’s HDFS RollingSink*)

in-progress committedpendingpending

epoch n-1 epoch n-2 epoch n-3epoch n

Exactly-Once Output

16

Page 50: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

17

Page 51: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Dataflow Reconfiguration

18

Page 52: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Dataflow Reconfiguration

18

Page 53: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Dataflow Reconfiguration

stop

snap-1 snap-2

18

Page 54: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Dataflow Reconfiguration

stop

snap-1 snap-2

snap-3

change parallelism

18

Page 55: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Dataflow Reconfiguration

stop

snap-1 snap-2

snap-3

change parallelism

Problem: How is state repartitioned from a snapshot?

18

Page 56: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: The Issue

19

Page 57: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: The Issue

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

19

Page 58: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: The Issue

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

19

Page 59: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: The Issue

case II

0x100: bob … … … … 0x449: alice

reconfigure

Include Key Locations in Snapshot Metadata

bob: 0x100 carol: 0x344 …

alice: 0x449 chuck: 0x630 …

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

19

Page 60: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: The Issue

case II

0x100: bob … … … … 0x449: alice

reconfigure

Include Key Locations in Snapshot Metadata

bob: 0x100 carol: 0x344 …

alice: 0x449 chuck: 0x630 …

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

too much

19

Page 61: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

alice

20

Page 62: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)

• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)

alice

20

Page 63: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)

• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)

alice

Note: number of key groups controls trade-off between metadata to keep and reconfiguration speed

20

Page 64: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

21

Page 65: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

22

Page 66: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

Pipeline v.1

22

Page 67: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

fork and update Pipeline v.1

Pipeline v.2

22

Page 68: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

fork and update Pipeline v.1

Pipeline v.2

22

Page 69: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

fork and update Pipeline v.1

Pipeline v.3

Pipeline v.2

22

Page 70: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Version Control

fork and update Pipeline v.1

Pipeline v.3

Pipeline v.2

22

Page 71: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

23

Page 72: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Isolation Levels

24

Page 73: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Isolation Levels

select from facebook.userID, clients.name … inner join clients on …

read-committed(snapshot)

read-uncommitted(dirty read on latest state)

external query

24

Page 74: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King

25

Page 75: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King10

0

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

25

Page 76: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King10

0

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Page 77: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Page 78: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Page 79: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

• #shuffles (keyby) • parallelism

25

Page 80: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Teaser: More paper highlights

• We can use the same technique to coordinate externally managed state with snapshots.

• Epoch markers can act as on-the-fly reconfiguration points.

• Internals of asynchronous and incremental snapshots.

26

Page 81: State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Paris Carbone<[email protected]> - KTH Royal Institute of Technology Stephan Ewen<[email protected]> - data Artisans Gyula Fóra<[email protected]> - King Digital Entertainment Ltd Seif Haridi<[email protected]> - KTH Royal Institute of Technology Stefan Richter<[email protected]> - data Artisans Kostas Tzoumas<[email protected]> - data Artisans

27

State Management in Apache Flink®

Consistent Stateful Distributed Stream Processing

@vldb17