apache flink - overview

31
1 http://flink.incubator.apache.org @ApacheFlink Apache Flink Next-Gen Data Analytics Platform

Upload: stephan-ewen

Post on 25-May-2015

5.445 views

Category:

Technology


3 download

DESCRIPTION

A first overview of Apache Flink (incubating) system

TRANSCRIPT

Page 1: Apache Flink - Overview

1

http://flink.incubator.apache.org@ApacheFlink

Apache FlinkNext-Gen Data Analytics Platform

Page 2: Apache Flink - Overview

2

What is Apache Flink

An efficient distributed general-purpose data analysis platform.

Runs on top ofHDFS and YARN.

Focusing on ease of programming.

Page 3: Apache Flink - Overview

Project status• Incubation at Apache Software Foundation

• Version 0.6 stable, 0.7-SNAPSHOT

• Came out of the Stratosphere research project

• Growing community of contributors

Page 4: Apache Flink - Overview

4

Flink Stack

JavaAPI

Storage

Flink Runtime

HDFS Local Files S3

ClusterManager

YARN EC2 Direct

Flink Optimizer

ScalaAPI

Spargel(graphs)

SQL,Python

...JDBC

HadoopMapReduce

Hive ...

Page 5: Apache Flink - Overview

5

Key Features

Easy to use developer APIs• Java, Scala, Graphs, Nested Data

(Python & SQL under development) • Flexible composition of large

programs

Automatic Optimization• Join algorithms• Operator chaining• Reusing partitioning/sorting

High Performance Runtime• Complex DAGs of operators• In memory & out-of-core• Data streamed between operations

Native Iterations• Embedded in the APIs• Data streaming / in-memory• Delta iterations speed up many

programs by orders of mag.

Page 6: Apache Flink - Overview

6

Flink Overview

Page 7: Apache Flink - Overview

7

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text.flatMap(new Splitter()).groupBy(0).aggregate(SUM, 1);

// map function implementationclass Splitter extends FlatMap<String, Tuple2<String, Integer>> {

public void flatMap(String value, Collector out){ for (String token : value.split("\\W")) { out.collect(new Tuple2<String, Integer>(token, 1)); } }}

Concise & rich APIs

Word Count, Java API

Can use regular POJOs!

Page 8: Apache Flink - Overview

8

Concise & rich APIs

val input = TextFile(textInput)val words = input flatMap { line => line.split("\\W+") }val counts = words groupBy { word => word } count()

Word Count, Scala API

Page 9: Apache Flink - Overview

9

Flexible Data Pipelines

Reduce

Join

Map

Reduce

Map

Iterate

Source

Sink

Source

Page 10: Apache Flink - Overview

10

Rich Building Blocks

• Map, FlatMap, MapPartition• Filter, Project• Reduce, ReduceGroup, Aggregate, Distinct• Join• CoGoup• Cross• Iterate• Iterate Delta• Graph Vertex-Centric (Pregel style)

Page 11: Apache Flink - Overview

11

DataSet<Tuple...> large = env.readCsv(...);DataSet<Tuple...> medium = env.readCsv(...);DataSet<Tuple...> small = env.readCsv(...);

DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1) .with(new JoinFunction() { ... });

DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2) .with(new JoinFunction() { ... });

DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);

Example: Joins in Flink

Built-in strategies include partitioned join and replicated join with local sort-merge or hybrid-hash algorithms.

⋈⋈

γ

large medium

small

Page 12: Apache Flink - Overview

12

DataSet<Tuple...> large = env.readCsv(...);DataSet<Tuple...> medium = env.readCsv(...);DataSet<Tuple...> small = env.readCsv(...);

DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1) .with(new JoinFunction() { ... });

DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2) .with(new JoinFunction() { ... });

DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);

Automatic Optimization

Possible execution 1) Partitioned hash-join

3) Grouping /Aggregation reuses the partitioningfrom step (1) No shuffle!!!

2) Broadcast hash-join

Partitioned ≈ Reduce-sideBroadcast ≈ Map-side

Page 13: Apache Flink - Overview

13

Running Programs

> bin/flink run prg.jar

Packaged ProgramsRemote EnvironmentLocal Environment

Program JAR file

JVM

master master

RPC &Serialization

RemoteEnvironment.execute()LocalEnvironment.execute()

Spawn embeddedmulti-threaded environment

Page 14: Apache Flink - Overview

14

Flink Runtime

Page 15: Apache Flink - Overview

15

Distributed Runtime• Master (Job Manager)

handles job submission, scheduling, and metadata

• Workers (Task Managers) execute operations

• Data can be streamed between nodes

• All operators startin-memory and graduallygo out-of-core

Page 16: Apache Flink - Overview

16

Runtime Architecture(comparison)

DistributedCollection

List[WC]

public class WC { public String word; public int count;}

emptypage

Pool of Memory Pages• Collections of objects• General-purpose

serializer (Java / Kryo)• Limited control over

memory & less efficientspilling

• Deserialize all or nothing

• Works on pages of bytes• Maps objects transparently to these pages• Full control over memory, out-of-core enabled• Algorithms work on binary representation• Address individual fields (not deserialize whole object)

Flink

Page 17: Apache Flink - Overview

17

Iterative Programs

Page 18: Apache Flink - Overview

18

Why Iterative Algorithms

• Algorithms that need iterationso Clustering (K-Means, Canopy, …)o Gradient descent (e.g., Logistic Regression, Matrix Factorization)o Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )o Graph communities / dense sub-componentso Inference (believe propagation)o …

• Loop makes multiple passes over the data

Page 19: Apache Flink - Overview

19

Iterations in other systems

Step Step Step Step Step

ClientLoop outside the system

Step Step Step Step Step

ClientLoop outside the system

Page 20: Apache Flink - Overview

Iterations in Flink

20

Streaming dataflowwith feedback

map

join

red.

join

System is iteration-aware, performs automatic optimization

Flink

Page 21: Apache Flink - Overview

21

Automatic Optimization for Iterative Programs

Caching Loop-invariant DataPushing work„out of the loop“

Maintain state as index

Page 22: Apache Flink - Overview

22

Unifies various kinds of Computations

ExecutionEnvironment env = getExecutionEnvironment();

DataSet<Long> vertexIds = ...DataSet<Tuple2<Long, Long>> edges = ...

DataSet<Tuple2<Long, Long>> vertices = vertexIds.map(new IdAssigner());

DataSet<Tuple2<Long, Long>> result = vertices .runOperation( VertexCentricIteration.withPlainEdges( edges, new CCUpdater(), new CCMessager(), 100));

result.print();env.execute("Connected Components");

Pregel/Giraph-style Graph Computation

Page 23: Apache Flink - Overview

23

Delta Iterations speed up certain problems

by a lot

0

200000

400000

600000

800000

1000000

1200000

1400000

Iteration

# V

ert

ices

(th

ou

san

ds)

Bulk

Delta

Twitter Webbase (20)0

1000

2000

3000

4000

5000

6000

Computations performed in each iteration for connected communities of a social graph

Runtime (secs)

Cover typical use cases of Pregel-like systems with comparable performance in a generic platform and developer API.

Page 24: Apache Flink - Overview

24

Program Optimization

Page 25: Apache Flink - Overview

25

Why Program Optimization ?

Do you want to hand-optimize that?

Page 26: Apache Flink - Overview

26

What is Automatic Optimization

Run on a sampleon the laptop

Run a month laterafter the data evolved

Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C

Page 27: Apache Flink - Overview

27

Using Flink

Page 28: Apache Flink - Overview

28

http://flink.incubator.apache.org

Page 29: Apache Flink - Overview

29

Its easy to get started…

wget http://www.apache.org/dyn/closer.cgi/incubator/flink/flink-0.6-incubating-bin-hadoop2-yarn.tgz

tar xvzf flink-0.6-incubating-bin-hadoop2-yarn.tgz ./flink-0.6-incubating/bin/yarn-session.sh -n 4 -jm 1024 -tm 3000

If you have YARN, deploy a full Flink setup in 3 commands:

Also works on Amazon Elastic MapReduce ;-)

Quickstart projects set up a program skeleton, includingembedded local execution/debugging environment…

For the experts…

trying it out…

$ wget https://.../flink-0.6-incubating.tgz$ tar xzf flink-*.tgz$ flink/bin/start-local.sh

running a local pseudo-clusterAlso available as aDebian package

Page 30: Apache Flink - Overview

30

Tutorial Example

Page 31: Apache Flink - Overview

flink.incubator.apache.org

github.com/apache/incubator-flink

@ApacheFlink

Where to find us