hardware utilisation techniques for data stream processing1412860/fulltext01.pdf · stream...

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Hardware Utilisation Techniques for Data Stream Processing

MAX MELDRUM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Abstract

Recent years have seen an increase in use of the stream processing architecture to

compose continuous analytics applications. This thesis presents the design of a

Rust-based stream processor that adopts two separate techniques to tackle existing

weaknesses in modern production-grade stream processors. The first technique

employs a data analytics language on top of the streaming runtime, in order

to provide both dataflow and low-level compiler optimisations. This technique

is motivated by an analysis of the impact that the lack of compiler integration

may have on the end-to-end performance of streaming pipelines in Apache Flink.

In the second technique streaming operators are scheduled using a task-parallel

approach to boost performance for skewed data distributions. The experimental

results for data-parallel streaming pipelines in this thesis demonstrate, that

the scheduling model of the prototype achieves performance improvements in

skewed scenarios without exhibiting any significant losses in performance during

uniform distributions.

Keywords

Data Analytics, Data Stream Processing, Distributed Systems

i

Abstrakt

Under senare år har användningen av strömbearbetningsarkitekturen ökat för att

komponera kontinuerliga analysapplikationer. Denna avhandling presenterar

designen av en Rust-baserad strömprocessor som använder två separata tekniker

för att hantera befintliga svagheter i moderna strömprocessorer. Den första

tekniken använder ett dataanalysspråk ovanpå strömprocessorn, för att ge både

dataflöde och kompilatoroptimeringar på låg nivå. Denna teknik är motiverad

av en analys av påverkan som bristen på kompilatorintegration kan ha på

den slutliga prestandan för analysapplikationer i Apache Flink. I den andra

tekniken schemaläggs strömningsoperatörer med hjälp av en uppgiftsparallell

metod för att öka prestanda för skev datadistribution. De experimentella

resultaten för data-parallella analysapplikationer i denna avhandling visar

att schemaläggningsmodellen för prototypen uppnår prestandaförbättringar i

ojämna distributioner utan att uppvisa några betydande förluster i prestanda

under enhetliga fördelningar.

ii

Acknowledgements

I would like to start by thanking Prof. Seif Haridi for giving me the opportunity to

work on the Continuous Deep Analytics (CDA) project. I have had the pleasure of

working under the supervision of excellent researchers during my thesis journey.

I would like to express my appreciation to my supervisors, Dr. Paris Carbone and

Lars Kroll, for their continuous guidance, support, and patience. I would also

like to thank Adam Hasselberg for helping me out with the implementation in

this thesis, but also for our interesting discussions in front of the whiteboard. In

addition, I would like to thank the rest of the CDA team for their feedback, this

includes Klas Segeljakt, Prof. Christian Schulte, and Dr. Theodore Vasiloudis.

Finally, I would like to thank my classmates, Nikhil and Óttar, for our bi-weekly

thesis support group.

iii

Authors

Max Meldrum Information and Communication TechnologyKTH Royal Institute of Technology

Place for Project

Stockholm, SwedenRISE SICS

Examiner

Prof. Seif HaridiKTH Royal Institute of Technology

Supervisors

Lars KrollKTH Royal Institute of Technology

Dr. Paris CarboneResearch Institutes of Sweden

iv

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Theoretical Background 10

2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Event-based Middleware . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Motivation 17

3.1 Hardware Acceleration through Computation Sharing . . . . . . . . 17

3.2 Execution Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Design 22

4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Results 27

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Experiment Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Benchmark Description . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Conclusions 36

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

References 38

v

1 Introduction

Batch processing [6, 18] have enabled powerful offline processing of data.

However, this class of systems were not designed for low-latency processing of

continuous incoming data. Therefore, a paradigm shift was imperative. Stream

processing systems ingest unbounded data and can process data in an online

fashion. Single node stream processors [16, 25, 34] focus on achieving high

performance on a single machine, whereas distributed stream processors (e.g.,

Apache Flink [11], Heron [28]) scale out over a cluster of commodity machines,

thus enabling fault tolerance and data-parallelism. Organisations use stream

processors to run never-ending services to drive their business-critical analytics.

Therefore, it is crucial that the systems are able to: (1) handle machine and

software failures; (2) execute efficiently on modern hardware; (3) adapt to volatile

workload skew.

Production grade distributed stream processors [8, 11, 28, 37] adhere to the

Dataflow/Beam model [2]. One common denominator of all these systems is that

they run on the Java Virtual Machine (JVM), which means inheriting limitations

such as runtime overhead due to garbage collection and failing to fully exploit

modern hardware as observed by Zeuch et al. [45]. With the growing amount

of data being produced daily, existing and future stream processors must be able

to process data at even greater speeds while still being resource-efficient. With

that serving as motivation, this work aims to investigate whether the following

techniques provide sufficient improvements over the state of the art:

Adoption of a more flexible thread model. The aforementioned systems typically

use a dedicated thread per streaming operator model. Although this allows for

high performance, it is far more restrictive. Higher levels of parallelism are only

possible through a more fine-tuned scheduling model. To that end, the aim is to

investigate if it is possible to achieve better hardware utilisation while scheduling

streaming operators on a work-stealing threadpool in which they would only be

active while having work.

Using a data analytics language. A shortcoming of existing distributed stream

processing systems is the inability to provide both dataflow and ahead-of-time

1

compiler optimisations. Supporting the latter is not possible while treating user-

defined functions in the framework as black boxes. To overcome this problem, an

intermediate representation (IR) can be utilised on top of a stream processor. The

IR would make it possible to optimise streaming dataflows more aggressively.

Although, this would come at the cost of making the system more restrictive, i.e.

limiting what user code may be accepted.

These techniques have been incorporated into a distributed data stream processor

written in Rust and are in this thesis evaluated against Apache Flink, a state-of-

the-art distributed stream processor.

1.1 Background

This section starts by describing the dataflow processing abstraction that is used in

the majority of data processing systems. This is followed by a brief introduction of

the Rust programming language together with a motivation of why the language

is suited for implementing a high performance data stream runtime. Finally, the

last section will introduce the Arcon system into which the streaming runtime in

this thesis is integrated.

1.1.1 Dataflow Processing

Distributed data processing frameworks [11, 41, 43] express applications as a

Directed Acyclic Graph (DAG) : G = (V, E), where vertices (V) represent the

computational task and edges (E), the data channels between tasks. Dataflow

graphs provide the necessary processing abstractions needed in both stream

and batch processing. Figure 1.1 depicts a streaming dataflow, where each task

manipulates incoming records and pushes the output along the graph. Source

and Sink tasks are a special form of task. The former acts as an entry point to the

dataflow graph, while the latter is responsible for taking the transformed data and

publishing it to an external endpoint (e.g., database). Normal tasks have both an

input and an output channel. On the other hand, sources only have one output

channel and sinks only have one input channel.

2

Source

Source

Sink

Filter

Filter

Figure1.1: Streaming Dataflow

1.1.2 Rust

Rust is a systems programming language that primarily focuses on safety and

performance. Ownership semantics are incorporated into the language which

enables it to guarantee memory and thread safety at compile time. The ownership

model solves the issue of heap memory management. This means Rust does

not suffer from any runtime overhead due to garbage collection. Since it runs

on bare-metal and is built on LLVM, Rust is able to deliver a similar runtime

performance as C/C++. The qualities above make Rust a suitable choice for a

stream processor. Streaming applications may run for weeks, months, or even

years. Thus, longer compilation times are affordable if they result in highly

efficient application binaries.

1.1.3 The Arcon System

Modern data processing frameworks are typically optimised for a specific

workload. Tensorflow [30] is tailored for machine learning programs. Ray

[35] targets AI applications in general and in particular Reinforcement Learning.

Spark [44] excels at relational operations and batch processing at scale. Flink [11]

is a distributed stream processor predominantly used for large-scale real-time

analytics. Giraph [21] is a graph processing framework that is exclusively utilised

for analysing massive graphs. The drawback with specialised systems become

apparent when one needs to combine many diverse workloads in an end-to-end

data analytics pipeline. Not only does the complexity increase (e.g., because each

framework has different processing guarantees), but it becomes harder to perform

3

optimisations across frameworks. Lastly, there are performance penalties due to

the data materialisation that occurs between frameworks. The Arcon [33] project

aims to tackle this problem and design the new generation of data programming

system for building continuous analytic applications. Arcon is made up of two

parts: 1) Arc [27], a language for batch-and-stream programming; 2) the Arcon

runtime, a distributed dataflow runtime based on stream processing that executes

applications constructed through Arc.

Figure 1.2 depicts the Arcon stack. At the top are examples of workloads that

Arcon aims to support, either through Arcon’s own language frontend or by

providing direct translations of libraries (e.g., Tensorflow [30], Beam [2], Pandas

[32]) to Arc.

Arc IR / Compiler

DynamicGraphs

Arcon Runtime

Event-BasedMicroservicesStream ML SQL

Operational Plane

Execution Plane

Figure1.2: System Overview

1.1.3.1 The Arc IR Arc extends the Weld [38] language, which targets batch

computation, to also include streaming semantics. The main motivation behind

Weld is that modern data analytic workflows tend to combine functions from

multiple libraries. This usually results in an inefficient execution due to isolated

execution and lack of optimisation across libraries. Weld tackles this by lazily

building up the computation for the full execution plan and evaluating the result

only when required. In addition to existing features in Weld, Arc adds constructs

that enable the language to form streaming pipelines composed of operators

connected by channels. Arc seamlessly enables the end user to take advantage

of both batch-and-stream processing in the same application. During Arc’s

4

compilation process, a plethora of dataflow optimisations [24] may be applied.

Listing 1 shows a simple Arc function that takes two signed integers and adds

them together.

1 |a: i32, b: i32| a + b

Listing1: Addition function

Arc Data Types. There are two categories of types in Arc, value types and builder

types. Value types are read-only and are meant to hold data. Down below are

some examples of value types:

• Scalars: bool, i8, u8, i16, u16, i32, u32, i64, u64, f32, f64

• Vectors: Vec[T]

• Structs: {T1, T2, ...}

• Dictionaries: Dict[K, V]

• Streams: Stream[T]

Builder types are write-only types that are used for parallel constructs. Elements

can be merged into a builder by using the merge operation. It may also be

materialised into a value type by calling result on it. Down below are some

examples of commonly used Arc builders and their respective output value

types:

• Appender[T]: Vec[T]

• Merger[T, binop]: T

• StreamAppender[T]: Stream[T]

1.1.3.2 Examples in Arc Listing 2 shows an example of the Appender[T] builder.

It is used to append values to a vector of type T. On line 4, the appender is

materialised into a Vec[i32].

5

1 let app = Appender[i32];

2 let app = merge(app, 10);

3 let app = merge(app, 20);

4 result(app) // [10, 20]

Listing2: Appender Builder

The Merger[T, binop] builder takes a type T and a commutative binary operation

(e.g., +, *). Listing 3 showcases an example using multiplication. On line 1, the

merger is initialised with the value 1, whereas if it had been an addition merger

(i.e. Merger[T, +]), it would have been set to 0.

1 let m = Merger[i32, *];

2 let m = merge(m, 3);

3 let m = merge(m, 5);

4 result(m) // 15

Listing3: Merger Builder

Listing 4 depicts a bounded (i.e. batch) map function in Arc. The function takes

a Vec[i32] which it then iterates over. For each iteration, it merges the current

element + 3 into an Appender builder. At the end of the for-loop, the appender is

returned and then materialised back into a Vec[i32] through the result operation.

1 |data: Vec[i32]| result(for(data, Appender[i32],

2 | app, element, index | merge(app , element + 3)))

Listing4: Arc Map operation

This next example in listing 5 shows a simple streaming pipeline using Arc. The

function accepts Stream[u32] and StreamAppender[u32] as arguments. The for-loop

on line 2 represents a streaming operator that receives inputs from a queue. Each

element is multiplied by 10 and then merged onto the outgoing channel by using

the StreamAppender builder. Figure 1.3 depicts the dataflow graph of this small

piece of Arc code.

6

1 |source: Stream[u32], sink: StreamAppender[u32]|

2 for (source, sink, | sink, element |

3 merge(sink, element * 10))

Listing5: Arc Example

Source Map Sink

Figure1.3: Dataflow graph of listing 5

Although the streaming pipeline in listing 5 is trivial, a range of advanced

pipelines may be expressed using the Arc IR.

1.1.3.3 Arcon Runtime At the heart of the Arcon system is the streaming-based

distributed runtime. The runtime is split into two areas of concern: Operability

and Execution.

The Operational Plane: The operational plane orchestrates the deployment of

applications on top of Arcon. It is responsible for coordinating the distributed

execution at a high-level. This includes application monitoring and management

of state. The operational plane is implemented in Scala using the actor framework

Akka [3]. This part of the runtime does not perform any heavy computation.

Therefore, Scala is chosen, not for its performance, but for its expressiveness and

interoperability with the rich ecosystem of the JVM (e.g., HDFS [40], YARN [42]).

The Execution Plane: The execution plane contains the core execution engine

of the system. It is responsible for the critical path of the data processing.

Streaming dataflows are typically deployed as static task graphs, which means

that computation (e.g., iterative, recursive) on top of these dataflow engines tend

to be static. The limitations of this approach are twofold. First, performance-

heavy algorithms may contend with the resources of the main streaming logic.

Secondly, the execution plan is static, hence, it is not able to diverge if the

7

available infrastructure changes (e.g., new machines are added). Arcon aims

to combine the long-running dataflow model for continuous processing together

with a dynamic task model. Whereas the latter focuses on batch processing which

can be dynamically scheduled during execution.

1.2 Contributions

This degree project provides three core contributions: Two techniques for

improving hardware utilisation in data stream processors and an architectural

overview of Arcon’s streaming runtime. T1. The first technique employs a data

analytics IR in order to provide both dataflow and instruction-level optimisations.

The thesis includes an analysis of how existing production-grade distributed

stream processors could enhance the performance of end-to-end pipelines by

adopting T1.

T2. In the second technique streaming operators are scheduled using a work-

stealing threadpool in contrast to the conventional dedicated thread model. T2

is evaluated under a data-parallel pipeline to study how it fares against a state-

of-the-art (Apache Flink) stream processor that is using the dedicated thread

approach.

1.3 Methodology

The degree project uses a descriptive method to review the current state-of-

the-art distributed stream processing and an experimental approach to perform

a quantitative performance analysis on the two proposed hardware utilisation

techniques.

1.4 Ethics and Sustainability

The system described in this thesis is a research prototype. Therefore there

are no proper security measures in place, i.e. data encryption. Data analytics

frameworks may be used with malicious intent and while implementing

8

encryption mechanisms for the system is straightforward, it is not feasible to

control what its users do. Sustainability is another important aspect while

designing new systems, thus the algorithms that are applied should be energy

efficient.

1.5 Delimitations

The ecosystem of the Rust programming language is still evolving. For that

reason, the implementation in this thesis is limited to a smaller set of stable

libraries. However, this is not an issue as the system in this degree project is still

a research prototype.

9

2 Theoretical Background

In this chapter, the necessary background knowledge required for the remainder of the

thesis is covered.

2.1 Stream Processing

Stream processors are typically categorised into either a continuous or micro-

batch model. Continuous operator based streaming systems [11, 36, 41] express

applications as a DAG consisting of long-running operators, where records are

processed as they arrive. Fault-tolerance in this particular model is commonly

achieved by utilising a checkpointing mechanism [12, 17]. This streaming model

enables low-latency processing, although, at the cost of expensive fault recovery

when the state is large. Micro-batch based streaming systems [15, 22, 43] use the

same DAG abstraction, but gather streaming data into micro-batches, which are

then processed using the Bulk Synchronous Parallel (BSP) model. Computations

in this model are deterministic, which allows for efficient fault recovery through

lineage tracking and techniques like parallel recovery [43]. Although the model

is able to achieve high throughput, it suffers from downsides such as high latency

due to task scheduling, and requiring a centralised driver program. Throughout

the rest of this report, the continuous operator model will be assumed.

Event-time vs. Progress time. Stream processor’s typically support different

notion of time. Event-time refers to the time at which an event was created, e.g.,

the time an external device produced a record. Whereas progress time refers to

the system time of the current machine.

Watermarks. Events may be ingested in parallel from multiple sources. As

a result, the events may arrive out-of-order. The watermark notion is used to

measure progress in a stream processor that utilises event-time. A watermark

is a message that flows through the dataflow graph. It contains a timestamp t,

which indicates the lowest timestamp in the system. If a timestamp t is received,

then there should no longer be any elements in the stream with a time lower than

t.

10

2.1.1 Apache Flink

Apache Flink [11] is a distributed stream processing system which adheres to

the continuous operator model. Flink runs on the JVM (Java Virtual Machine)

and unifies batch processing and stream processing under a single engine. The

unification is achieved by adapting the Dataflow model [2], in which batch data

is treated as a finite stream. Flink handles out-of-order processing [29], provides

exactly-once semantics [10, 13], and integrates seamlessly with existing cluster

managers such as YARN [42] and Kubernetes [23].

Flink Application: Flink uses a declarative API enabling users to define their

applications in a concise way. Listing 6 showcases a simple Flink program using

the programming language Scala. The stream object on line 2 represents a stream

of unbounded integers. In this example, two common stream transformations are

applied. First, negative numbers are filtered away. Then, each positive integer is

multiplied by 5. Finally, the print method on line 5 instructs Flink to print out the

result. A stream transformation occurs when a stream operator transforms the

current stream into a new one, with potentially a new data type.

1 val env = StreamExecutionEnvironment.getExecutionEnvironment

2 val stream: DataStream[Int] = env.addSource(...)

3 stream.filter(i => i > 0)

4 .map(x => x * 5)

5 .print()

6

7 env.execute("Flink Example")

Listing6: Flink Application

The above application is turned into a dataflow graph as seen in figure 2.1.

User-Defined-Functions: Flink supports a variety of different stream

transformations [5]. Flink also has support for creating custom UDFs. If it is

the case that the ones provided are not enough. Each transformation or so called

11

Source

Sink

MapFilter Print

Figure2.1: Dataflow Graph of listing 6

Source

Sink

MapFilterPrint

OperatorChain

Figure2.2: Optimised Dataflow Graph

operator executes inside a Flink task. Each task runs in its own thread. It is

possible in Flink to fuse operators into a single task. In which each operator

becomes a subtask of an operator chain. Figure 2.2 shows an optimised version

of figure 2.1. The filter and map operators are now chained inside one task, thus

reducing the overhead of running them across different threads.

Thread Model: Flink runs a dedicated thread for each task. In short, each task

runs an infinite while loop, where it tries to consume messages from upstream

tasks. This approach enables high performance, but at the cost of idle resources if

the task is not continuously fed data.

Memory Management: Flink takes explicit control of memory management. It is

done by pre-allocating memory segments in which Flink serialises objects into. In

order to avoid the cost of serialisation and deserialisation, Flink tries to operate on

binary data as much as possible. As the data is in bytes, spilling to disk becomes

trivial. Using this approach, Flink is able to avoid an OutOfMemory error that

will shut down the JVM. Also, the pressure on the garbage collector is reduced as

Flink avoids putting large objects on the heap.

12

2.2 Event-based Middleware

Kompact [26] is an event-based middleware that combines the Component

Model of Kompics [7] together with the Actor Model [1]. An introduction of

aforementioned models will first be given before delving further into the Kompact

framework.

2.2.1 Actor Model

The Actor Model is a computational model based on asynchronous message-

passing. Actors encapsulate their own state and interact with other actors by

sending messages. Messages are stored in a mailbox that is tied to each actor.

Actors process messages from the mailbox in a FIFO order. Figure 2.3 showcases

a simple scenario where Actor A sends messages Actor B’s mailbox.

Actor A Actor BMessagsFIFO order

Figure2.3: Actors

In response to a message, an actor may take the following actions:

• send messages to other actors

• send a response

• change its state

• create more actors

• terminate itself

By using the actor model, the application complexity decreases as their is no need

for explicit thread synchronisation (e.g., Mutex/Locks). There are many notable

actor systems [4, 9] out there today, many of which are being used in production

in large companies.

13

2.2.2 Component Model

Similar to the Actor model, the component model is an event-based asynchronous

message- passing model that is suited for building distributed systems. The model

consists of a set of components that execute concurrently and interact through

statically-typed ports that are connected by FIFO ordered channels.

Figure2.4: Component Model

Component. A component has its own internal state which can only be accessed

through interaction with the component. This interaction occurs through the Port

interface.

Ports. A port supports bidirectional communication. Ports use the notion of

Request and Indication to express the direction of events. Components either

provide or require a port. As seen in figure 2.5, Request represents an incoming

event on this port. Thus, a component that utilises this port must implement

Provide in which it specifies how the message is to be handled. Whereas a Require

implementation is needed for the Indication of the port.

14

Port

Request Indication

Figure2.5: Port

2.2.3 Kompact

Kompact empowers system designers with the option of composing applications

using the component model, actor model, or a combination of them both. For

applications that need to spawn workers during execution, using the actor model

is then an excellent approach as it is more suitable for a dynamic environment.

Whereas if the application setup is predefined, using the component model may

be a more fitting choice. Depending on the application, it may also make sense to

combine the two models.

Kompact System. The Kompact System proivdes an execution runtime for

components. This includes scheduling, timers, and networking. The internal

scheduler aims to ensure fairness between the components running on the

Kompact System. Components are typically scheduled on top of a threadpool,

but it is also possible to plug in custom schedulers. It is also possible to tweak

other settings such as throughput, message priority, and number of threads used

by the scheduler.

Kompact Component. A Kompact component will always execute sequentially

in a single-threaded fashion and may be scheduled onto different threads over its

lifetime. Figure 2.6 depicts how Kompact components may be used to structure

applications.

Kompact Scheduling. Whenever a Kompact component is scheduled, the inner

execute function of the component will attempt to process events in the following

order: (1) Control Events (e.g., Stop/Kill Component); (2) Timers; (3) Messages

15

Figure2.6: Kompact Overview

(i.e. user-defined messages).

Kompact Networking. Components in Kompact may communicate over the

network by using the ActorPath object. Kompact currently only supports the TCP

protocol, however, it is possible to extend to support others (e.g., UDT/QUIC).

All connections to a remote Kompact System are multiplexed over a single

TCP channel. The underlying network implementation utilises the Tokio1

library which is the most popular network framework for systems in the Rust

ecosystem.

1https://tokio.rs

16

3 Motivation

This chapter is split into two sections. The first section will demonstrate through

a contrived experiment the need for stronger compiler integration in data stream

processing systems. The experiment aims to serve as a motivation for the

execution goals described in the ensuing section.

3.1 Hardware Acceleration through Computation Sharing

We have observed that common JVM based stream processors such as Apache

Flink fail to support low-level compiler optimisations and this may lead to

inefficient streaming pipelines due to non-optimised stream transformations.The

source code for this experiment is publicly available on Github [14].

Apache Flink treat user-defined-functions as black boxes and is therefore not

capable of optimising across functions. However, to improve performance, Flink

groups higher order functions into an operator-chain where each function is called

one by one through out the chain. The chaining approach leads to improved

performance compared to running the functions across threads (i.e. Flink tasks).

Figure 3.1 depicts four different levels of fusion using a streaming pipeline that

aims to perform an addition of 4 to the element it receives. (1) No Fusion. The

no-fusion level is a naïve version of the pipeline in which an user has written

four map transformations after each other. (2) Task Fusion. This optimisation

takes all four map transformations and puts them into an operator-chain that

executes inside a single Flink task. (3) Invocation-level Fusion. The invocation

fusion combines all x + 1 additions into a for-loop that is executed by a single

function. The invocation-level fusion was added to simulate the task fusion in

order to understand the cost of invoking multiple functions calls in the operator-

chain. (4) Instruction-level Fusion. The strongest fusion level achievable is at the

instruction-level where the stream transformations have been partially evaluated

into a single instruction (i.e. x + 4).

17

Figure3.1: Fusion Levels

The executed experiment follows the same structure as figure 3.1. However, the

map transformation is an addition of 50 instead of 4, which means that the no-

fusion approach will consist of 50 successive Flink tasks.

Experimental Setup. The experiment was executed on a Ubuntu 18.04 machine

with Intel i7-4510U CPU @ 2.00GHz processor using Apache Flink version 1.9.0

and Java OracleJDK 1.8.0_221.

Results. Figure 3.2 depicts the outcome of the experiment. The slowest execution

time (105 seconds) occurs when performing no fusion at all, that is, when disabling

the operator chaining in Flink. If the chaining is enabled (Task Fusion), then the

execution time goes down to 22 seconds, a 4.7x improvement. Instruction-level

fusion (1.2 seconds) shows a 3x speed up over the invocation optimisation and

18.3x over Task Fusion.

18

Figure3.2: Apache Flink Fusion Experiment

Analysis. The results show that there is a cost of doing inter-task calls within the

Flink operator-chain. Other than having to execute more instructions and switch

environment for each operator, Flink, according to a design proposal [20] currently

defaults to copy every element between operators in the chain. Typically, copying

objects on the JVM is expensive, especially when dealing with more advanced

data types such JSON or Avro. However, the benchmark being discussed only

deals with a trivial type (i.e Int) that should be virtually free to copy. Observation:

data scientists should not be required to have low-level system knowledge on the

data analytic framework they are using. The compiler needs to be an integral part

of the framework stack in order to guarantee that the uttermost efficient code is

executed.

19

3.2 Execution Goals

The design of Arcon’s streaming runtime is motivated by the following goals:

R1. The system must provide a high performance execution environment for

the critical path of the data processing. The environment must also ensure the

execution occurs without any form of disruption (e.g., garbage collection).

R2. The system must support both dataflow and low-level compiler

optimisations.

R3. The system must adopt a flexible event-based middleware that seamlessly

enables plugging in different scheduling implementations and network protocols

(e.g., TCP, QUIC) based on the application requirements.

Rust and its LLVM backend was chosen in order to tackle R1. The JVM has for a

long time been the de facto standard choice for data processing frameworks (e.g.,

Apache Spark, Apache Flink). However, it is far from perfect. JVM based systems

often execute slower than languages that compile down to efficient machine code

(e.g., C/C++, Rust). Another limitation is that the critical path may be interrupted

momentarily due to garbage collection. It is important to note that although there

are many upsides to move away from the , there are also drawbacks:

• Cloud Deployment. One of the advantages with building JVM-basedapplications is the ease of deployment in today’s cloud environments. It

is much easier to launch a application compared to an equivalent Rust

application. In Rust, one would need to ensure that the compiled binary

matches the host architecture and that its linked properly (i.e. if not statically

linked).

• Dynamic Code Deployment. Being able to dynamically update streaminglogic is an important feature for long-running pipelines. This operation

becomes more complicated if you have application code that is compiled

into native machine code.

Another important design goal (R2) is being able to support dataflow

optimisations (e.g., fusion, fission), but also low-level compiler techniques such

as partial evaluation and loop unrolling. To meet this requirement, Arcon’s

20

streaming runtime is heavily dependent on the Arc compiler to generate optimised

code (e.g., user-defined-functions) for its batch backend. To achieve R3, the

streaming runtime adopts Kompact, a hybrid concurrent component and actor

model middleware. Kompact offers great flexibility (e.g., simplicity of adding

new features) and provides out-of-the-box support for networking, scheduling,

and timers.

21

4 Design

This chapter will cover the design of Arcon’s streaming runtime. The First section

will present the system model and the second section will give an overview of the

architectural layout.

4.1 System Model

The system is composed of the following elements: processes, local channels, physical

network channels. At a high-level, the system can be defined as a set of processes

that are connected by a finite number of physical network channels. Within each

process there may exist a bounded number of channels for local message-passing.

State. Stateful processing is under development and is therefore beyond the

scope of this thesis.

Channels. Both local channels and the physical network channels take on reliable

FIFO order delivery, where the latter achieves it through the use of TCP.

Failure Model. A fail-stop [39] model is assumed, which means that if a single

process crashes due to software or machine failure, it is no longer part of the

system.

4.2 System Overview

The compiler pipeline in figure 4.1 presents the steps required to get from an Arc

application down to an Arcon Executable. The dataflow graph in figure 4.2 will be

used as an example in order to introduce the internals of the Arcon runtime. For

simplicity, a local (i.e. no networking) single binary setup is assumed. The graph

consists of two stream transformations, filter and map. Both transformations have

a parallelism of two. Figure 4.3 depicts how the aforementioned dataflow graph

is mapped down to runtime components.

22

Figure4.1: Arcon Compiler Pipeline

Figure4.2: Example Dataflow Graph

23

Figure4.3: Execution Layout for Figure 4.2

The execution layout of the Arcon runtime shown in figure 4.3 consist of the

following core elements:

• Arcon Nodes (i.e. Kompact Components)

• Kompact Scheduler

• Channels

Arcon Node. An Arcon Node (i.e. filter/map in figure 4.3) is a Kompact

component that receives ArconMessages and produces ArconMessages as output.

An ArconMessage contains the following: (1) Sender ID that is used to identify

channels; (2) a payload (e.g., watermark or stream element).

Arcon Nodes are loaded with an operator function (i.e. user-defined-function) that

is executed per received stream element. The operator function is generated by

the Arc compiler and may produce a single output event or multiple in the case

of a FlatMap function. Figure 4.4 shows the complete layout of a Node. The Node

abstraction plays a central role in the design as a complete streaming pipeline is

essentially a directed graph of Node components.

24

Figure4.4: Arcon Node

Kompact Scheduling. Arcon Nodes are driven by the Kompact scheduler. The

scheduler is instantiated with a number of workers (i.e. threads) as seen in figure

4.3. Typically, the amount of workers will be the number of CPU cores on the

machine in order to maximise the parallelisation. Each worker will attempt to

grab a scheduled component from its local job queue, however if the queue is

empty, it will try to steal from other workers. Arcon Nodes are only scheduled to

execute if they have work to accomplish (e.g., their message queue is non-empty).

To ensure fairness between components, each schedule iteration only processes

a pre-determined number of events (step 2 in figure 4.3) before asking to be

scheduled again. Thus, by increasing this number, one increases the throughput

of the component at the cost of potentially blocking other components from being

scheduled.

In contrast to Flink, Arcon, at the runtime level does not take on a continuous

long-running dataflow model, but rather a task-parallel approach. Multiple Arcon

Nodes may be multiplexed over a single thread (i.e. worker). Figure 4.5 illustrates

how this would look like using the example in figure 4.2 but with a parallelism of

4 for map and filter.

25

Figure4.5: Multiplexed Arcon Nodes

Channels. A channel in the Arcon runtime represents a connection to another

Arcon Node. Channels may either be local (ActorRef ) or remote (ActorPath). A

channel strategy refers to the choice of channel an Arcon Node decides to send on.

Typically, the strategy will be a forward one where there is a one-to-one mapping

between two Nodes. Other types of strategies are listed down below:

• Broadcast→ Send to all channels

• Random-Shuffle → Selects channel randomly based on a uniformdistribution

• Round-Robin→ Selects channel in a round-robin fashion

• KeyBy→ Selects channel through hash partitioning

26

5 Results

The experiment in this section aims to investigate the suitability of Arcon’s task-

parallel approach to scheduling streaming operators. Stream processing systems

such as Apache Flink use a dedicated thread model for streaming operators as it

provides high performance. However, this approach is more restrictive as it limits

the system to the number of cores on the machine. Arcon adopts a scheduling

model that may multiplex multiple streaming operators over the same thread in

order to gain more fine-grained parallelism.

To make it easier to follow along, the streaming pipelines in this chapter will

be presented using the Flink API rather than Arc IR. The repository for the

benchmarks in this chapter is available on Github [31].

5.1 Experimental Setup

Table 5.1 shows the list of specifications related to the experiments in this thesis.

The list includes both information about the execution environment, but also

versions of the frameworks that were used.

Type Value

Memory 16GJVM Heap Size 2048m

Physical CPU cores 4 (8 HT)OS Arch Linux 5.3.7

Processor Intel i7-4770 CPU @ 3.40GHzFlink 1.9.0Rust 1.39.0-nightlyJava OracleJDK build 1.8.0_221

Table5.1: Benchmark Specs

5.2 Experiment Context

Data-parallelism in streaming dataflows is typically achieved by partitioning

the stream based on some key. Apache Flink uses the notion of key-groups

27

to split stream records into parallel downstream operators. The parallelism of

the operator is usually less or equal to the number of CPU cores available. It is

rather hard to control or predict the distribution of the stream records entering

the system. It may happen that a majority of the records belong to a certain key-

group and thus overloading one partial operator. Figure 5.1 shows an example

with 4 key-groups (assuming a 4 core machine) with one of the partitions being

overwhelmed.

Figure5.1: Skewed Partitioning

The skewed partitioning in figure 5.1 leads to underutilisation of CPU cores and

thus potentially slowing down the performance of the pipeline. One solution to

the problem could be to add more key-groups in order to distribute the partitions

more evenly. Figure 5.2 depicts how the workload could be distributed if 4 more

key-groups are added.

Figure5.2: Balanced Partitioning

The key-group setup is now running 8 parallel operators using 4 CPU cores.

Utilising more parallel operators than available CPU cores is typically not

recommended while using a dedicated thread model due to context switching.

Arcon on the contrary is expected to perform better in such a setup as its work-

stealing scheduler is well suited to run more operators than cores.

28

Hypothesis. Arcon should outperform Flink given that a streaming pipeline

is skewed and that the number of key-group partitions exceeds the number of

CPU cores available.

The objective of the experiment is to study how both scheduling models

(Arcon/Flink) behave under a data-parallel pipeline and gain insight on the cost

and potential benefits of Arcon’s scheduling model.

5.3 Benchmark Description

The benchmark will revolve around a simple streaming pipeline consisting of the

following operators: source, map, and sink. The source node partitions data to X

number of map nodes. The partitioning occurs through the use of consistent hash

partitioning (i.e. hash modulo parallelism). Both Arcon and Flink use the same

Murmur3 [19] hash function in order to ensure that the partitions are equal in both

systems. The pipeline code is depicted in listing 7 and its equivalent dataflow

graph in figure 5.3.

1 case class Item(id: Int, number: Long, iterations: Int)

2 case class EnrichedItem(id: Int, root: Double)

3 val env = StreamExecutionEnvironment.getExecutionEnvironment

4 val source: DataStream[Item] = env.addSource(...)

5 source.keyBy(_.id)

6 .map(item => {

7 val root = newtonSqrt(item.number, item.iterations)

8 new EnrichedItem(item.id, root)

9 }).addSink(...).setParallelism(1)

10

11 env.execute("Benchmark Pipeline")

Listing7: Benchmark Pipeline

Data Generation. As seen on line 6 in listing 7, the Item object is being keyed

on the id field. The data generator produces ids between the 1-50 range with its

29

Figure5.3: Data-parallel dataflow graph

distribution depending on which mode is used, that is, skewed or uniform. The

second field of the Item object is a number between 1-100 that will be used for the

workload function. Lastly, the iterations field is used to control the amount of

work the function will do.

Workload Function. Each Map operator executes per received stream element

the square root function depicted in listing 8. This particular workload was

selected because it is fair (i.e. 64-bit floating point operations) between Rust and

the JVM, but also that the amount of workload is controllable by the number of

iterations.

1 def newtonSqrt(square: Long, iters: Int): Double = {

2 val target = square.toDouble

3 var currentGuess = target

4 for( _

Flink Execution Setup. The Flink setup as illustrated in figure 5.4 remains

the same during the benchmarks. The TaskManager contains 4 TaskSlots and

therefore allowing a maximum parallelism of 4.

Figure5.4: Flink Pipeline Setup

Arcon Execution Setup. In order to make it more fair with Flink, both the source

and sink components will run on dedicated threads. The rest of the components

(i.e. mappers) are scheduled on top of the work-stealing scheduler. Figure 5.5

shows an example using a parallelism (p) of 8 and 4 scheduling threads (t).

Kompact’s throughput setting was set to 50 for the benchmarks in order to ensure

fairness between components. That is, each component processes a maximum of

50 messages before being rescheduled.

Figure5.5: Arcon Pipeline Setup

31

5.4 Results

The results are measured by the overall execution time it takes for the sink to

finish processing the 10 million stream elements that it is expecting. The final

execution time is gathered by doing five runs and then taking the average. The

baseline number of partitions is 4 as our experimental setup has 4 physical cores,

thus allowing a maximum of 4 parallel operators in Flink. The benchmark was

executed with two types of partition distributions. The first type was generated

based on a uniform distribution with its partitions resulting in a (24.6%, 29.5%,

21.3%, 24.6%) split. Whereas the second type was intentionally generated as

skewed (51.3%, 19.7%, 6.6%, 22.4%). Arcon is executed with a range of different

map parallelism/partitions (p) and scheduling threads (t), while Flink assumes a

(p=4, t=4) setup. Three different kind of workloads were executed, low, medium,

and high. The low workload ran with 100 newton iterations while medium and

high, 500, and 1000, respectively.

Figure 5.6 shows the results of the benchmark pipeline when running with a

uniform distribution with its result summarised as following:

Low Workload. Flink is 39.34% faster than Arcon’s baseline setup (p=4, t=4) and

25.21% faster than the best performing Arcon setup (i.e. p=6, t=3).

Medium Workload. Flink outperforms Arcon by 12.84% for the baseline setup

and by 9.38% for the best run (p=12,t=4).

High Workload. Arcon’s baseline is 10.21% slower than Flink during high

workload and 5.72% slower when running with a p=8,t=4 setup.

32

05

1015202530354045

Low Medium High

Exec

utio

n tim

e (s

econ

ds)

Map Operator Workload

Uniform Data p4=(24.6%, 29.5%, 21.3%, 24.6%) 10M Elements

Flink, p=4, t=4Arcon, p=4, t=4Arcon, p=4, t=2Arcon, p=6, t=3Arcon, p=8, t=4

Arcon, p=12, t=4

Figure5.6: Uniform data benchmark using three different workload scenarios.

The next graph (figure 5.7) depicts the benchmark outcome when running with a

skewed data distribution. The results are summed up down below :

Low Workload. Flink finishes 33.57% faster than the baseline setup (p=4,t=4) and

1.81% when comparing to Arcon’s best performing setup (p=6,t=3).

Medium Workload. Arcon is 18.24% slower with its baseline setup but 32.29%

faster than Flink when running with p=12,t=4.

High Workload. Flink is 13.28% faster than Arcon’s baseline, but 19.54% slower

compared to Arcon (p=12,t=4).

Partition distribution. The distribution with a parallelism of 4 is depicted in

figure 5.7 while the other distributions are as following: p=6: (17.08%, 6.56%,

32.89%, 5.26%, 17.11%, 21.07%). p=8: (31.58%, 5.25%, 5.24%, 19.73%, 10.53%,

9.22%, 17.10%, 1.31%). p=12: (5.25%, 27.62%, 11.84%, 2.62%, 2.63%, 2.63%,

5.26%, 1.30%, 5.27%, 11.85%, 9.22%, 14.46%).

33

0

10

20

30

40

50

60

Low Medium High

Exec

utio

n tim

e (s

econ

ds)

Map Operator Workload

Skewed Data p4=(51.3%, 19.7%, 6.6%, 22.4%) 10M Elements

Flink, p=4, t=4Arcon, p=4, t=4Arcon, p=4, t=2Arcon, p=6, t=3Arcon, p=8, t=4

Arcon, p=12, t=4

Figure5.7: Skewed data benchmark using three different workload scenarios.

5.5 Analysis

The results show that there is a noticeable scheduling overhead for Arcon when

the workload is low. Flink outperforms Arcon in the low workload scenario for

both uniform and skewed data. This is likely due to the work-stealing scheduler

not getting enough work and momentarily going to sleep. Seeing Arcon perform

better with p=4,t=2 than p=4,t=4 helps support the aforementioned assumption.

It should be noted that Arcon’s p=6,t=3 setup performs reasonably well in all runs

despite using one less mapper thread than Flink.

Another observation is that during medium and high workload scenarios, Arcon

pays 5-10% in scheduling overhead while running with uniform partitions, but

has the potential to gain up to 20-35% in performance when handling skewed

data distributions. Arcon is able to tackle the skewed data problem by creating

finer-grained partitions. Multiple partitions may be multiplexed over a single

thread in order to ensure better utilisation and therefore performance. Arcon

does not benefit much by running with an equal assignment of components to

threads (i.e. p = t), and achieves the best results with p >= t ∗ 2. Both p=8,t=4 andp=12,t=4 are able to perform well in both data distributions.

34

The data supports the hypothesis laid out in 5.2. Arcon surpasses Flink in

performance while experiencing skewed data by creating finer-grained partitions

and thus distributing the incoming data more evenly. In general, it is hard to

outperform a dedicated thread model in raw performance as it is continuously

consumes and processes stream elements without any scheduling overhead.

However, as seen by the results, it is feasible to get comparable results while

also enabling a more flexible thread-model. It opens up the possibility of sharing

batch and stream computation using the same scheduler.

35

6 Conclusions

Distributed stream processing systems ingest unbounded data and aim to

guarantee low-latency processing while providing high-throughput capabilities.

In addition, they commonly support the ability to scale out over hundreds of

commodity machines. Although there has been a significant amount of research

put into the field, there are still aspects that can be improved. Existing production-

grade stream processors lack proper compiler integration and therefore miss

out on notable low-level compiler optimisations (e.g., fusion of functions at the

instruction-level). Another observation is that the commonly used dedicated

thread model for streaming operators is not well suited for handling skewed

pipelines. It may happen that one or two parallel operators get most of the work,

thus leaving the remaining threads (cores) underutilised. This thesis presented

the design of a Rust based stream processor that addresses the aforementioned

limitations. The compiler integration occurs through Arc, a language for batch-

and-stream programming that provides both dataflow and low-level compiler

optimisations. Chapter 3 included an analysis of how an existing stream processor

(Apache Flink) could benefit by supporting more aggressive optimisations as

such provided by Arc. The analysis served as a motivation for the design laid

out in chapter 4. Lastly, the work-stealing based scheduling model is evaluated

against Apache Flink in chapter 5. The evaluation show promising results for

a proof-of-concept system compared to a state-of-the-art stream processor that

has seen numerous optimisations over the years. The prototype system is able

to outperform Apache Flink in skewed distributions while achieving comparable

results during uniform data distributions.

6.1 Future Work

There is plenty of interesting follow-up work to this thesis where the most notable

ones are listed down below:

• Low Workload Performance. The results in chapter 5 showed that Arconhad trouble competing with Flink in the low workload scenario. This needs

to be investigated. It might just be that the workers of the scheduler are not

36

getting enough work and therefore parking their threads temporarily.

• Advanced Benchmarks. It would be interesting to run more complexbenchmark pipelines that would include parallel sources/sinks, distributed

execution, and more systems, e.g., Structured Streaming.

• State Management. Arcon does currently not support stateful processing.A natural next-step would be to integrate a state backend (e.g., RocksDB)

and include state management in future evaluations.

37

Acronyms

BSP Bulk Synchronous Parallel.

DAG Directed Acyclic Graph.

JVM Java Virtual Machine.

References

[1] Agha, Gul. Actors: A Model of Concurrent Computation in Distributed Systems.

Cambridge, MA, USA: MIT Press, 1986. isbn: 0-262-01092-5.

[2] Akidau, Tyler et al. “The Dataflow Model: A Practical Approach to

Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded,

Out-of-order Data Processing”. In: Proc. VLDB Endow. 8.12 (Aug. 2015),

pp. 1792–1803. issn: 2150-8097. doi: 10 . 14778 / 2824032 . 2824076. url:

http://dx.doi.org/10.14778/2824032.2824076.

[3] Akka: build concurrent, distributed, and resilient message-driven applications for

Java and Scala | Akka. https://akka.io/. (Accessed on 05/10/2019).

[4] Akka: build concurrent, distributed, and resilient message-driven applications for

Java and Scala | Akka. https://akka.io/. (Accessed on 07/23/2019).

[5] Apache Flink 1.8 Documentation: Operators. https : / / ci . apache . org /

projects / flink / flink - docs - stable / dev / stream / operators/.

(Accessed on 07/15/2019).

[6] Apache Hadoop. https://hadoop.apache.org/. (Accessed on 02/13/2019).

38

http://dx.doi.org/10.14778/2824032.2824076http://dx.doi.org/10.14778/2824032.2824076https://akka.io/https://akka.io/https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/https://hadoop.apache.org/

[7] Arad, Cosmin, Dowling, Jim, and Haridi, Seif. “Message-passing

Concurrency for Scalable, Stateful, Reconfigurable Middleware”. In:

Proceedings of the 13th International Middleware Conference. Middleware ’12.

ontreal, Quebec, Canada: Springer-Verlag New York, Inc., 2012, pp. 208–

228. isbn: 978-3-642-35169-3. url: http://dl.acm.org/citation.cfm?id=

2442626.2442640.

[8] Armbrust, Michael et al. “Structured Streaming: A Declarative API for Real-

Time Applications in Apache Spark”. In: Proceedings of the 2018 International

Conference on Management of Data. SIGMOD ’18. Houston, TX, USA: ACM,

2018, pp. 601–613. isbn: 978-1-4503-4703-7. doi: 10.1145/3183713.3190664.

url: http://doi.acm.org/10.1145/3183713.3190664.

[9] Armstrong, Joe. Programming Erlang: Software for a Concurrent World.

Pragmatic Bookshelf, 2007. isbn: 193435600X, 9781934356005.

[10] Carbone, Paris. “Scalable and Reliable Data Stream Processing”. QC

20180823. PhD thesis. KTH, Software and Computer systems, SCS, 2018,

p. 180. isbn: 978-91-7729-901-1.

[11] Carbone, Paris et al. “Apache FlinkTM: Stream and Batch Processing in a

Single Engine”. In: IEEE Data Eng. Bull. 38.4 (2015), pp. 28–38.

[12] Carbone, Paris et al. “Lightweight Asynchronous Snapshots for Distributed

Dataflows”. In: CoRR abs/1506.08603 (2015). arXiv: 1506.08603. url: http:

//arxiv.org/abs/1506.08603.

[13] Carbone, Paris et al. “State Management in Apache Flink&Reg;: Consistent

Stateful Distributed Stream Processing”. In: Proc. VLDB Endow. 10.12 (Aug.

2017), pp. 1718–1729. issn: 2150-8097. doi: 10.14778/3137765.3137777.

url: https://doi.org/10.14778/3137765.3137777.

[14] cda-group/flink_forward_arc: Experiment used in the Flink Forward 2019 Arc

presentation. https : / / github . com / cda - group / flink _ forward _ arc.

(Accessed on 11/13/2019).

[15] Chambers, Craig et al. “FlumeJava: Easy, Efficient Data-parallel Pipelines”.

In: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language

Design and Implementation. PLDI ’10. Toronto, Ontario, Canada: ACM, 2010,

39

http://dl.acm.org/citation.cfm?id=2442626.2442640http://dl.acm.org/citation.cfm?id=2442626.2442640http://dx.doi.org/10.1145/3183713.3190664http://doi.acm.org/10.1145/3183713.3190664http://arxiv.org/abs/1506.08603http://arxiv.org/abs/1506.08603http://arxiv.org/abs/1506.08603http://dx.doi.org/10.14778/3137765.3137777https://doi.org/10.14778/3137765.3137777https://github.com/cda-group/flink_forward_arc

pp. 363–375. isbn: 978-1-4503-0019-3. doi: 10.1145/1806596.1806638. url:

http://doi.acm.org/10.1145/1806596.1806638.

[16] Chandramouli, Badrish et al. “Trill: A High-performance Incremental Query

Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),

pp. 401–412. issn: 2150-8097. doi: 10.14778/2735496.2735503. url: http:

//dx.doi.org/10.14778/2735496.2735503.

[17] Chandy, K. Mani and Lamport, Leslie. “Distributed Snapshots: Determining

Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb.

1985), pp. 63–75. issn: 0734-2071. doi: 10.1145/214451.214456. url: http:

//doi.acm.org/10.1145/214451.214456.

[18] Dean, Jeffrey and Ghemawat, Sanjay. “MapReduce: Simplified Data

Processing on Large Clusters”. In: Proceedings of the 6th Conference on

Symposium on Opearting Systems Design & Implementation - Volume 6.

OSDI’04. San Francisco, CA: USENIX Association, 2004, pp. 10–10. url:

http://dl.acm.org/citation.cfm?id=1251254.1251264.

[19] flink/MathUtils.java at master ů apache/flink. https://github.com/apache/

flink/blob/master/flink-core/src/main/java/org/apache/flink/

util/MathUtils.java#L134. (Accessed on 11/13/2019).

[20] FLIP-21 - Improve object Copying/Reuse Mode for Streaming Runtime - Apache

Flink - Apache Software Foundation. https : / / cwiki . apache . org /

confluence/pages/viewpage.action?pageId=71012982. (Accessed on

10/14/2019).

[21] Han, Minyang and Daudjee, Khuzaima. “Giraph Unchained: Barrierless

Asynchronous Parallel Execution in Pregel-like Graph Processing Systems”.

In: Proc. VLDB Endow. 8.9 (May 2015), pp. 950–961. issn: 2150-8097. doi:

10.14778/2777598.2777604. url: https://doi.org/10.14778/2777598.

2777604.

[22] He, Bingsheng et al. “Comet: Batched Stream Processing for Data Intensive

Distributed Computing”. In: Proceedings of the 1st ACM Symposium on Cloud

Computing. SoCC ’10. Indianapolis, Indiana, USA: ACM, 2010, pp. 63–74.

isbn: 978-1-4503-0036-0. doi: 10.1145/1807128.1807139. url: http://doi.

acm.org/10.1145/1807128.1807139.

40

http://dx.doi.org/10.1145/1806596.1806638http://doi.acm.org/10.1145/1806596.1806638http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.1145/214451.214456http://doi.acm.org/10.1145/214451.214456http://doi.acm.org/10.1145/214451.214456http://dl.acm.org/citation.cfm?id=1251254.1251264https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71012982https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71012982http://dx.doi.org/10.14778/2777598.2777604https://doi.org/10.14778/2777598.2777604https://doi.org/10.14778/2777598.2777604http://dx.doi.org/10.1145/1807128.1807139http://doi.acm.org/10.1145/1807128.1807139http://doi.acm.org/10.1145/1807128.1807139

[23] Hightower, Kelsey, Burns, Brendan, and Beda, Joe. Kubernetes: Up and

Running Dive into the Future of Infrastructure. 1st. O’Reilly Media, Inc., 2017.

isbn: 1491935677, 9781491935675.

[24] Hirzel, Martin et al. “A Catalog of Stream Processing Optimizations”. In:

ACM Comput. Surv. 46.4 (Mar. 2014), 46:1–46:34. issn: 0360-0300. doi: 10.

1145/2528412. url: http://doi.acm.org/10.1145/2528412.

[25] Koliousis, Alexandros et al. “SABER: Window-Based Hybrid Stream

Processing for Heterogeneous Architectures”. In: Proceedings of the 2016

International Conference on Management of Data. SIGMOD ’16. San Francisco,

California, USA: ACM, 2016, pp. 555–569. isbn: 978-1-4503-3531-7. doi: 10.

1145/2882903.2882906. url: http://doi.acm.org/10.1145/2882903.

2882906.

[26] Kroll, Lars. “Compile-time Safety and Runtime Performance in

Programming Frameworks for Distributed Systems”. PhD thesis. KTH,

Software and Computer systems, SCS, 2020. Unpublished.

[27] Kroll, Lars et al. “Arc: An IR for Batch and Stream Programming”. In:

Proceedings of the 17th ACM SIGPLAN International Symposium on Database

Programming Languages. DBPL 2019. Phoenix, AZ, USA: ACM, 2019, pp. 53–

58. isbn: 978-1-4503-6718-9. doi: 10.1145/3315507.3330199. url: http:

//doi.acm.org/10.1145/3315507.3330199.

[28] Kulkarni, Sanjeev et al. “Twitter Heron: Stream Processing at Scale”. In:

Proceedings of the 2015 ACM SIGMOD International Conference on Management

of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–

250. isbn: 978-1-4503-2758-9. doi: 10.1145/2723372.2742788. url: http:

//doi.acm.org/10.1145/2723372.2742788.

[29] Li, Jin et al. “Out-of-order Processing: A New Architecture for High-

performance Stream Systems”. In: Proc. VLDB Endow. 1.1 (Aug. 2008),

pp. 274–288. issn: 2150-8097. doi: 10.14778/1453856.1453890. url: http:

//dx.doi.org/10.14778/1453856.1453890.

[30] Martn Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous

Systems. Software available from tensorflow.org. 2015. url: http : / /

tensorflow.org/.

41

http://dx.doi.org/10.1145/2528412http://dx.doi.org/10.1145/2528412http://doi.acm.org/10.1145/2528412http://dx.doi.org/10.1145/2882903.2882906http://dx.doi.org/10.1145/2882903.2882906http://doi.acm.org/10.1145/2882903.2882906http://doi.acm.org/10.1145/2882903.2882906http://dx.doi.org/10.1145/3315507.3330199http://doi.acm.org/10.1145/3315507.3330199http://doi.acm.org/10.1145/3315507.3330199http://dx.doi.org/10.1145/2723372.2742788http://doi.acm.org/10.1145/2723372.2742788http://doi.acm.org/10.1145/2723372.2742788http://dx.doi.org/10.14778/1453856.1453890http://dx.doi.org/10.14778/1453856.1453890http://dx.doi.org/10.14778/1453856.1453890http://tensorflow.org/http://tensorflow.org/

[31] Max-Meldrum/arcon_benchmarks. https : / / github . com / Max - Meldrum /

arcon_benchmarks. (Accessed on 11/13/2019).

[32] McKinney, Wes. “Data Structures for Statistical Computing in Python”. In:

Proceedings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt

and Jarrod Millman. 2010, pp. 51–56.

[33] Meldrum, Max et al. “Arcon: Continuous and Deep Data Stream Analytics”.

In: Proceedings of Real-Time Business Intelligence and Analytics. BIRTE 2019.

Los Angeles, CA, USA: ACM, 2019, 3:1–3:3. isbn: 978-1-4503-7660-0. doi:

10.1145/3350489.3350492. url: http://doi.acm.org/10.1145/3350489.

3350492.

[34] Miao, Hongyu et al. “StreamBox: Modern Stream Processing on a Multicore

Machine”. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17).

Santa Clara, CA: USENIX Association, 2017, pp. 617–629. isbn: 978-1-931971-

38-6. url: https://www.usenix.org/conference/atc17/technical-

sessions/presentation/miao.

[35] Moritz, Philipp et al. “Ray: A Distributed Framework for Emerging AI

Applications”. In: 13th USENIX Symposium on Operating Systems Design and

Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 2018,

pp. 561–577. url: https : / / www . usenix . org / conference / osdi18 /

presentation/nishihara.

[36] Murray, Derek G. et al. “Naiad: A Timely Dataflow System”. In: Proceedings

of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP

’13. Farminton, Pennsylvania: ACM, 2013, pp. 439–455. isbn: 978-1-4503-

2388-8. doi: 10.1145/2517349.2522738. url: http://doi.acm.org/10.

1145/2517349.2522738.

[37] Noghabi, Shadi A. et al. “Samza: Stateful Scalable Stream Processing at

LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn:

2150-8097. doi: 10.14778/3137765.3137770. url: https://doi.org/10.

14778/3137765.3137770.

[38] Palkar, Shoumik et al. “Weld: A common runtime for high performance data

analytics”. In: Conference on Innovative Data Systems Research (CIDR). 2017.

42

https://github.com/Max-Meldrum/arcon_benchmarkshttps://github.com/Max-Meldrum/arcon_benchmarkshttp://dx.doi.org/10.1145/3350489.3350492http://doi.acm.org/10.1145/3350489.3350492http://doi.acm.org/10.1145/3350489.3350492https://www.usenix.org/conference/atc17/technical-sessions/presentation/miaohttps://www.usenix.org/conference/atc17/technical-sessions/presentation/miaohttps://www.usenix.org/conference/osdi18/presentation/nishiharahttps://www.usenix.org/conference/osdi18/presentation/nishiharahttp://dx.doi.org/10.1145/2517349.2522738http://doi.acm.org/10.1145/2517349.2522738http://doi.acm.org/10.1145/2517349.2522738http://dx.doi.org/10.14778/3137765.3137770https://doi.org/10.14778/3137765.3137770https://doi.org/10.14778/3137765.3137770

[39] Schlichting, Richard D. and Schneider, Fred B. “Fail-stop Processors: An

Approach to Designing Fault-tolerant Computing Systems”. In: ACM Trans.

Comput. Syst. 1.3 (Aug. 1983), pp. 222–238. issn: 0734-2071. doi: 10.1145/

357369.357371. url: http://doi.acm.org/10.1145/357369.357371.

[40] Shvachko, Konstantin et al. “The Hadoop Distributed File System”. In:

Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and

Technologies (MSST). MSST ’10. Washington, DC, USA: IEEE Computer

Society, 2010, pp. 1–10. isbn: 978-1-4244-7152-2. doi: 10.1109/MSST.2010.

5496972. url: http://dx.doi.org/10.1109/MSST.2010.5496972.

[41] Toshniwal, Ankit et al. “Storm@Twitter”. In: Proceedings of the 2014 ACM

SIGMOD International Conference on Management of Data. SIGMOD ’14.

Snowbird, Utah, USA: ACM, 2014, pp. 147–156. isbn: 978-1-4503-2376-5.

doi: 10.1145/2588555.2595641. url: http://doi.acm.org/10.1145/

2588555.2595641.

[42] Vavilapalli, Vinod Kumar et al. “Apache Hadoop YARN: Yet Another

Resource Negotiator”. In: Proceedings of the 4th Annual Symposium on Cloud

Computing. SOCC ’13. Santa Clara, California: ACM, 2013, 5:1–5:16. isbn:

978-1-4503-2428-1. doi: 10.1145/2523616.2523633. url: http://doi.acm.

org/10.1145/2523616.2523633.

[43] Zaharia, Matei et al. “Discretized Streams: Fault-tolerant Streaming

Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium

on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM,

2013, pp. 423–438. isbn: 978-1-4503-2388-8. doi: 10.1145/2517349.2522737.

url: http://doi.acm.org/10.1145/2517349.2522737.

[44] Zaharia, Matei et al. “Spark: Cluster Computing with Working Sets”. In:

Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing.

HotCloud’10. Boston, MA: USENIX Association, 2010, pp. 10–10. url: http:

//dl.acm.org/citation.cfm?id=1863103.1863113.

[45] Zeuch, Steffen et al. “Analyzing Efficient Stream Processing on Modern

Hardware”. In: Proc. VLDB Endow. 12.5 (Jan. 2019), pp. 516–530. issn: 2150-

8097. doi: 10.14778/3303753.3303758. url: https://doi.org/10.14778/

3303753.3303758.

43

http://dx.doi.org/10.1145/357369.357371http://dx.doi.org/10.1145/357369.357371http://doi.acm.org/10.1145/357369.357371http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1145/2588555.2595641http://doi.acm.org/10.1145/2588555.2595641http://doi.acm.org/10.1145/2588555.2595641http://dx.doi.org/10.1145/2523616.2523633http://doi.acm.org/10.1145/2523616.2523633http://doi.acm.org/10.1145/2523616.2523633http://dx.doi.org/10.1145/2517349.2522737http://doi.acm.org/10.1145/2517349.2522737http://dl.acm.org/citation.cfm?id=1863103.1863113http://dl.acm.org/citation.cfm?id=1863103.1863113http://dx.doi.org/10.14778/3303753.3303758https://doi.org/10.14778/3303753.3303758https://doi.org/10.14778/3303753.3303758

TRITA TRITA-EECS-EX-2019:799

www.kth.se

IntroductionBackgroundDataflow ProcessingRustThe Arcon System

ContributionsMethodologyEthics and SustainabilityDelimitations

Theoretical BackgroundStream ProcessingApache Flink

Event-based MiddlewareActor ModelComponent ModelKompact

MotivationHardware Acceleration through Computation SharingExecution Goals

DesignSystem ModelSystem Overview

ResultsExperimental SetupExperiment ContextBenchmark DescriptionResultsAnalysis

ConclusionsFuture Work

References

hardware utilisation techniques for data stream processing1412860/fulltext01.pdf · stream...

Documents