state management in apache flink : consistent stateful distributed stream processing

Post on 17-Mar-2018

562 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology Stephan Ewen<stephan@data-artisans.com> - data Artisans Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology Stefan Richter<s.richter@data-artisans.com> - data Artisans Kostas Tzoumas<kostas@data-artisans.com> - data Artisans

1

State Management in Apache Flink®

Consistent Stateful Distributed Stream Processing

@vldb17

Overview

• The Apache Flink System Architecture

• Pipelined Consistent Snapshots

• Operations with Snapshots

• Large Scale Deployments and Evaluation

2

The Apache Flink Framework

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

3

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

4

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

4

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

Client

optimised logical graph

4

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

4

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

4

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

4

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

locally managed state

4

Zookeeper

• passive failover • snapshot metadata

Distributed Architecture

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

SQL

Tabl

e

CEP

Gra

phs

MLLibraries

Core API

Runner

Setup

Job Manager

Task Manager

Task Manager

….

• scheduling • state partitioning • snapshot coordination

Client

optimised logical graph

• memory management • local snapshot execution • flow control

physical long-runningtasks

locally managed state

ExternalSnapshot Store(e.g., hdfs)

partial snapshots

4

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots

5

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots

6

Stateful Processing

tasktasktask

7

Stateful Processing

tasktasktask

invoke per input record

7

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

7

Local State Backend

physicaloperations

In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store

(RocksDB)

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

7

Local State Backend

physicaloperations

In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store

(RocksDB)

Stateful Processing

tasktasktask

readwrite

managed state

logical operations (collections)

invoke per input record

state = f(input)

7

8

local statesinput

streams

8

local statesinput

streams

stream processor

8

local statesinput

streams

divide computationinto epochs

stream processor

8

local statesinput

streams

capture all local states after completing an epoch

divide computationinto epochs

stream processor

8

local statesinput

streams

capture all local states after completing an epoch

divide computationinto epochs

stream processor

can rollback input and state to captured point in the past

8

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 1

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

drain epoch 2

Snapshot Store

copy states

A Synchronous Approach

master

9

• In use: Storm Trident and Spark Streaming

• A conservative approach, equivalent to batching

• Can cause unnecessary latency (master coordination)

• Processing is no longer continuous

• Forces many tasks to be idle

• Instead, in Apache Flink snapshots are pipelined

Synchronous Snapshots

10

Pipelined Snapshots

Snapshot Store

async state copy

11

Pipelined Snapshots

Snapshot Store

async state copy

insert markers

11

Pipelined Snapshots

Snapshot Store

async state copy

insert markers

A

BC D

E

11

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

11

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B

11

Pipelined Snapshots

Snapshot Store

async state copy

epoch alignment

A

BC D

E

B

11

Pipelined Snapshots

Snapshot Store

async state copy

epoch alignment

A

BC D

E

B A

11

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B A C

11

Pipelined Snapshots

Snapshot Store

async state copy

A

BC D

E

B A C D E

11

Pipelined Snapshots

Snapshot Store

async state copysnapshotcompletes

A

BC D

E

B A C D E

11

Pipelined Snapshots (cycles)

12

Pipelined Snapshots (cycles)

Problem: we cannot wait indefinitely for records in cycles

12

Pipelined Snapshots (cycles)

Problem: we cannot wait indefinitely for records in cycles

Solution: log in snapshot inflight records within a cycle

Replay upon recovery. 12

• Offers exactly-once processing guarantees

• Issued periodically/externally by the user

• Naturally respects flow control mechanisms

• Channel state logging limited to cycles only

• Multiple epoch snapshots can be pipelined

• Can offer weaker at-least-once processing guarantees by simply dropping aligning vs no alignment cost

Technique Highlights

13

1. End-to-End Guarantees

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

14

Exactly-Once: Input and Processing

Important Assumptions

• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)

• Data Channels are FIFO and reliable (no loss)

Each epoch either completes or repeats

15

• Idempontency ~ repeated operations can be tolerated after recovery/rollback (works for mutable stores).

• Transactional Processing ~ Requires a two-phase coordination. A snapshot completion eventually leads to external commit (e.g., Flink’s HDFS RollingSink*)

in-progress committedpendingpending

epoch n-1 epoch n-2 epoch n-3epoch n

Exactly-Once Output

16

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

17

Dataflow Reconfiguration

18

Dataflow Reconfiguration

18

Dataflow Reconfiguration

stop

snap-1 snap-2

18

Dataflow Reconfiguration

stop

snap-1 snap-2

snap-3

change parallelism

18

Dataflow Reconfiguration

stop

snap-1 snap-2

snap-3

change parallelism

Problem: How is state repartitioned from a snapshot?

18

Reconfiguration: The Issue

19

Reconfiguration: The Issue

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

19

Reconfiguration: The Issue

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

19

Reconfiguration: The Issue

case II

0x100: bob … … … … 0x449: alice

reconfigure

Include Key Locations in Snapshot Metadata

bob: 0x100 carol: 0x344 …

alice: 0x449 chuck: 0x630 …

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

19

Reconfiguration: The Issue

case II

0x100: bob … … … … 0x449: alice

reconfigure

Include Key Locations in Snapshot Metadata

bob: 0x100 carol: 0x344 …

alice: 0x449 chuck: 0x630 …

0x100: bob … … … … 0x449: alice

reconfigure

case I full scan

Scan Remote Storage for Responsible Keys

too slow

too much

19

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

alice

20

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)

• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)

alice

20

Reconfiguration: Key GroupsPre-partition state in

hash(K) space, into key-groups

bob……

… ………

• Snapshot Metadata: Contains a reference per stored Key-Group (less metadata)

• Reconfiguration: Contiguous key-group allocation to available tasks (less IO)

alice

Note: number of key groups controls trade-off between metadata to keep and reconfiguration speed

20

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

21

Version Control

22

Version Control

Pipeline v.1

22

Version Control

fork and update Pipeline v.1

Pipeline v.2

22

Version Control

fork and update Pipeline v.1

Pipeline v.2

22

Version Control

fork and update Pipeline v.1

Pipeline v.3

Pipeline v.2

22

Version Control

fork and update Pipeline v.1

Pipeline v.3

Pipeline v.2

22

Snapshots

2. Reconfiguration

3. Version Control 4. Isolation

Snapshots Usages

1. End-to-End Guarantees

23

Isolation Levels

24

Isolation Levels

select from facebook.userID, clients.name … inner join clients on …

read-committed(snapshot)

read-uncommitted(dirty read on latest state)

external query

24

Large Scale Deployment at King

25

Large Scale Deployment at King10

0

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

25

Large Scale Deployment at King10

0

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

25

Large Scale Deployment at King

30 50 70Parallelism

0

200

400

600

800

1000

1200

1400

Tota

lAlig

nmen

tTim

e(m

sec)

PROCWINOUT

alignmentcost

100

200

300

400

500

Global State Size (GB)

0

50

100

150

200

250

Tota

lSna

psho

ttin

gTi

me

(sec

)

total time / snapshot(alignment + async copies)

~runtime overhead

• #shuffles (keyby) • parallelism

25

Teaser: More paper highlights

• We can use the same technique to coordinate externally managed state with snapshots.

• Epoch markers can act as on-the-fly reconfiguration points.

• Internals of asynchronous and incremental snapshots.

26

Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology Stephan Ewen<stephan@data-artisans.com> - data Artisans Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology Stefan Richter<s.richter@data-artisans.com> - data Artisans Kostas Tzoumas<kostas@data-artisans.com> - data Artisans

27

State Management in Apache Flink®

Consistent Stateful Distributed Stream Processing

@vldb17

top related