hardware utilisation techniques for data stream processing1412860/fulltext01.pdf · stream...
TRANSCRIPT
-
IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2019
Hardware Utilisation Techniques for Data Stream Processing
MAX MELDRUM
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
-
Abstract
Recent years have seen an increase in use of the stream processing architecture to
compose continuous analytics applications. This thesis presents the design of a
Rust-based stream processor that adopts two separate techniques to tackle existing
weaknesses in modern production-grade stream processors. The first technique
employs a data analytics language on top of the streaming runtime, in order
to provide both dataflow and low-level compiler optimisations. This technique
is motivated by an analysis of the impact that the lack of compiler integration
may have on the end-to-end performance of streaming pipelines in Apache Flink.
In the second technique streaming operators are scheduled using a task-parallel
approach to boost performance for skewed data distributions. The experimental
results for data-parallel streaming pipelines in this thesis demonstrate, that
the scheduling model of the prototype achieves performance improvements in
skewed scenarios without exhibiting any significant losses in performance during
uniform distributions.
Keywords
Data Analytics, Data Stream Processing, Distributed Systems
i
-
Abstrakt
Under senare år har användningen av strömbearbetningsarkitekturen ökat för att
komponera kontinuerliga analysapplikationer. Denna avhandling presenterar
designen av en Rust-baserad strömprocessor som använder två separata tekniker
för att hantera befintliga svagheter i moderna strömprocessorer. Den första
tekniken använder ett dataanalysspråk ovanpå strömprocessorn, för att ge både
dataflöde och kompilatoroptimeringar på låg nivå. Denna teknik är motiverad
av en analys av påverkan som bristen på kompilatorintegration kan ha på
den slutliga prestandan för analysapplikationer i Apache Flink. I den andra
tekniken schemaläggs strömningsoperatörer med hjälp av en uppgiftsparallell
metod för att öka prestanda för skev datadistribution. De experimentella
resultaten för data-parallella analysapplikationer i denna avhandling visar
att schemaläggningsmodellen för prototypen uppnår prestandaförbättringar i
ojämna distributioner utan att uppvisa några betydande förluster i prestanda
under enhetliga fördelningar.
ii
-
Acknowledgements
I would like to start by thanking Prof. Seif Haridi for giving me the opportunity to
work on the Continuous Deep Analytics (CDA) project. I have had the pleasure of
working under the supervision of excellent researchers during my thesis journey.
I would like to express my appreciation to my supervisors, Dr. Paris Carbone and
Lars Kroll, for their continuous guidance, support, and patience. I would also
like to thank Adam Hasselberg for helping me out with the implementation in
this thesis, but also for our interesting discussions in front of the whiteboard. In
addition, I would like to thank the rest of the CDA team for their feedback, this
includes Klas Segeljakt, Prof. Christian Schulte, and Dr. Theodore Vasiloudis.
Finally, I would like to thank my classmates, Nikhil and Óttar, for our bi-weekly
thesis support group.
iii
-
Authors
Max Meldrum Information and Communication TechnologyKTH Royal Institute of Technology
Place for Project
Stockholm, SwedenRISE SICS
Examiner
Prof. Seif HaridiKTH Royal Institute of Technology
Supervisors
Lars KrollKTH Royal Institute of Technology
Dr. Paris CarboneResearch Institutes of Sweden
iv
-
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Theoretical Background 10
2.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Event-based Middleware . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Motivation 17
3.1 Hardware Acceleration through Computation Sharing . . . . . . . . 17
3.2 Execution Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Design 22
4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Results 27
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Experiment Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Benchmark Description . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Conclusions 36
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References 38
v
-
1 Introduction
Batch processing [6, 18] have enabled powerful offline processing of data.
However, this class of systems were not designed for low-latency processing of
continuous incoming data. Therefore, a paradigm shift was imperative. Stream
processing systems ingest unbounded data and can process data in an online
fashion. Single node stream processors [16, 25, 34] focus on achieving high
performance on a single machine, whereas distributed stream processors (e.g.,
Apache Flink [11], Heron [28]) scale out over a cluster of commodity machines,
thus enabling fault tolerance and data-parallelism. Organisations use stream
processors to run never-ending services to drive their business-critical analytics.
Therefore, it is crucial that the systems are able to: (1) handle machine and
software failures; (2) execute efficiently on modern hardware; (3) adapt to volatile
workload skew.
Production grade distributed stream processors [8, 11, 28, 37] adhere to the
Dataflow/Beam model [2]. One common denominator of all these systems is that
they run on the Java Virtual Machine (JVM), which means inheriting limitations
such as runtime overhead due to garbage collection and failing to fully exploit
modern hardware as observed by Zeuch et al. [45]. With the growing amount
of data being produced daily, existing and future stream processors must be able
to process data at even greater speeds while still being resource-efficient. With
that serving as motivation, this work aims to investigate whether the following
techniques provide sufficient improvements over the state of the art:
Adoption of a more flexible thread model. The aforementioned systems typically
use a dedicated thread per streaming operator model. Although this allows for
high performance, it is far more restrictive. Higher levels of parallelism are only
possible through a more fine-tuned scheduling model. To that end, the aim is to
investigate if it is possible to achieve better hardware utilisation while scheduling
streaming operators on a work-stealing threadpool in which they would only be
active while having work.
Using a data analytics language. A shortcoming of existing distributed stream
processing systems is the inability to provide both dataflow and ahead-of-time
1
-
compiler optimisations. Supporting the latter is not possible while treating user-
defined functions in the framework as black boxes. To overcome this problem, an
intermediate representation (IR) can be utilised on top of a stream processor. The
IR would make it possible to optimise streaming dataflows more aggressively.
Although, this would come at the cost of making the system more restrictive, i.e.
limiting what user code may be accepted.
These techniques have been incorporated into a distributed data stream processor
written in Rust and are in this thesis evaluated against Apache Flink, a state-of-
the-art distributed stream processor.
1.1 Background
This section starts by describing the dataflow processing abstraction that is used in
the majority of data processing systems. This is followed by a brief introduction of
the Rust programming language together with a motivation of why the language
is suited for implementing a high performance data stream runtime. Finally, the
last section will introduce the Arcon system into which the streaming runtime in
this thesis is integrated.
1.1.1 Dataflow Processing
Distributed data processing frameworks [11, 41, 43] express applications as a
Directed Acyclic Graph (DAG) : G = (V, E), where vertices (V) represent the
computational task and edges (E), the data channels between tasks. Dataflow
graphs provide the necessary processing abstractions needed in both stream
and batch processing. Figure 1.1 depicts a streaming dataflow, where each task
manipulates incoming records and pushes the output along the graph. Source
and Sink tasks are a special form of task. The former acts as an entry point to the
dataflow graph, while the latter is responsible for taking the transformed data and
publishing it to an external endpoint (e.g., database). Normal tasks have both an
input and an output channel. On the other hand, sources only have one output
channel and sinks only have one input channel.
2
-
Source
Source
Sink
Filter
Filter
Figure1.1: Streaming Dataflow
1.1.2 Rust
Rust is a systems programming language that primarily focuses on safety and
performance. Ownership semantics are incorporated into the language which
enables it to guarantee memory and thread safety at compile time. The ownership
model solves the issue of heap memory management. This means Rust does
not suffer from any runtime overhead due to garbage collection. Since it runs
on bare-metal and is built on LLVM, Rust is able to deliver a similar runtime
performance as C/C++. The qualities above make Rust a suitable choice for a
stream processor. Streaming applications may run for weeks, months, or even
years. Thus, longer compilation times are affordable if they result in highly
efficient application binaries.
1.1.3 The Arcon System
Modern data processing frameworks are typically optimised for a specific
workload. Tensorflow [30] is tailored for machine learning programs. Ray
[35] targets AI applications in general and in particular Reinforcement Learning.
Spark [44] excels at relational operations and batch processing at scale. Flink [11]
is a distributed stream processor predominantly used for large-scale real-time
analytics. Giraph [21] is a graph processing framework that is exclusively utilised
for analysing massive graphs. The drawback with specialised systems become
apparent when one needs to combine many diverse workloads in an end-to-end
data analytics pipeline. Not only does the complexity increase (e.g., because each
framework has different processing guarantees), but it becomes harder to perform
3
-
optimisations across frameworks. Lastly, there are performance penalties due to
the data materialisation that occurs between frameworks. The Arcon [33] project
aims to tackle this problem and design the new generation of data programming
system for building continuous analytic applications. Arcon is made up of two
parts: 1) Arc [27], a language for batch-and-stream programming; 2) the Arcon
runtime, a distributed dataflow runtime based on stream processing that executes
applications constructed through Arc.
Figure 1.2 depicts the Arcon stack. At the top are examples of workloads that
Arcon aims to support, either through Arcon’s own language frontend or by
providing direct translations of libraries (e.g., Tensorflow [30], Beam [2], Pandas
[32]) to Arc.
Arc IR / Compiler
DynamicGraphs
Arcon Runtime
Event-BasedMicroservicesStream ML SQL
Operational Plane
Execution Plane
Figure1.2: System Overview
1.1.3.1 The Arc IR Arc extends the Weld [38] language, which targets batch
computation, to also include streaming semantics. The main motivation behind
Weld is that modern data analytic workflows tend to combine functions from
multiple libraries. This usually results in an inefficient execution due to isolated
execution and lack of optimisation across libraries. Weld tackles this by lazily
building up the computation for the full execution plan and evaluating the result
only when required. In addition to existing features in Weld, Arc adds constructs
that enable the language to form streaming pipelines composed of operators
connected by channels. Arc seamlessly enables the end user to take advantage
of both batch-and-stream processing in the same application. During Arc’s
4
-
compilation process, a plethora of dataflow optimisations [24] may be applied.
Listing 1 shows a simple Arc function that takes two signed integers and adds
them together.
1 |a: i32, b: i32| a + b
Listing1: Addition function
Arc Data Types. There are two categories of types in Arc, value types and builder
types. Value types are read-only and are meant to hold data. Down below are
some examples of value types:
• Scalars: bool, i8, u8, i16, u16, i32, u32, i64, u64, f32, f64
• Vectors: Vec[T]
• Structs: {T1, T2, ...}
• Dictionaries: Dict[K, V]
• Streams: Stream[T]
Builder types are write-only types that are used for parallel constructs. Elements
can be merged into a builder by using the merge operation. It may also be
materialised into a value type by calling result on it. Down below are some
examples of commonly used Arc builders and their respective output value
types:
• Appender[T]: Vec[T]
• Merger[T, binop]: T
• StreamAppender[T]: Stream[T]
1.1.3.2 Examples in Arc Listing 2 shows an example of the Appender[T] builder.
It is used to append values to a vector of type T. On line 4, the appender is
materialised into a Vec[i32].
5
-
1 let app = Appender[i32];
2 let app = merge(app, 10);
3 let app = merge(app, 20);
4 result(app) // [10, 20]
Listing2: Appender Builder
The Merger[T, binop] builder takes a type T and a commutative binary operation
(e.g., +, *). Listing 3 showcases an example using multiplication. On line 1, the
merger is initialised with the value 1, whereas if it had been an addition merger
(i.e. Merger[T, +]), it would have been set to 0.
1 let m = Merger[i32, *];
2 let m = merge(m, 3);
3 let m = merge(m, 5);
4 result(m) // 15
Listing3: Merger Builder
Listing 4 depicts a bounded (i.e. batch) map function in Arc. The function takes
a Vec[i32] which it then iterates over. For each iteration, it merges the current
element + 3 into an Appender builder. At the end of the for-loop, the appender is
returned and then materialised back into a Vec[i32] through the result operation.
1 |data: Vec[i32]| result(for(data, Appender[i32],
2 | app, element, index | merge(app , element + 3)))
Listing4: Arc Map operation
This next example in listing 5 shows a simple streaming pipeline using Arc. The
function accepts Stream[u32] and StreamAppender[u32] as arguments. The for-loop
on line 2 represents a streaming operator that receives inputs from a queue. Each
element is multiplied by 10 and then merged onto the outgoing channel by using
the StreamAppender builder. Figure 1.3 depicts the dataflow graph of this small
piece of Arc code.
6
-
1 |source: Stream[u32], sink: StreamAppender[u32]|
2 for (source, sink, | sink, element |
3 merge(sink, element * 10))
Listing5: Arc Example
Source Map Sink
Figure1.3: Dataflow graph of listing 5
Although the streaming pipeline in listing 5 is trivial, a range of advanced
pipelines may be expressed using the Arc IR.
1.1.3.3 Arcon Runtime At the heart of the Arcon system is the streaming-based
distributed runtime. The runtime is split into two areas of concern: Operability
and Execution.
The Operational Plane: The operational plane orchestrates the deployment of
applications on top of Arcon. It is responsible for coordinating the distributed
execution at a high-level. This includes application monitoring and management
of state. The operational plane is implemented in Scala using the actor framework
Akka [3]. This part of the runtime does not perform any heavy computation.
Therefore, Scala is chosen, not for its performance, but for its expressiveness and
interoperability with the rich ecosystem of the JVM (e.g., HDFS [40], YARN [42]).
The Execution Plane: The execution plane contains the core execution engine
of the system. It is responsible for the critical path of the data processing.
Streaming dataflows are typically deployed as static task graphs, which means
that computation (e.g., iterative, recursive) on top of these dataflow engines tend
to be static. The limitations of this approach are twofold. First, performance-
heavy algorithms may contend with the resources of the main streaming logic.
Secondly, the execution plan is static, hence, it is not able to diverge if the
7
-
available infrastructure changes (e.g., new machines are added). Arcon aims
to combine the long-running dataflow model for continuous processing together
with a dynamic task model. Whereas the latter focuses on batch processing which
can be dynamically scheduled during execution.
1.2 Contributions
This degree project provides three core contributions: Two techniques for
improving hardware utilisation in data stream processors and an architectural
overview of Arcon’s streaming runtime. T1. The first technique employs a data
analytics IR in order to provide both dataflow and instruction-level optimisations.
The thesis includes an analysis of how existing production-grade distributed
stream processors could enhance the performance of end-to-end pipelines by
adopting T1.
T2. In the second technique streaming operators are scheduled using a work-
stealing threadpool in contrast to the conventional dedicated thread model. T2
is evaluated under a data-parallel pipeline to study how it fares against a state-
of-the-art (Apache Flink) stream processor that is using the dedicated thread
approach.
1.3 Methodology
The degree project uses a descriptive method to review the current state-of-
the-art distributed stream processing and an experimental approach to perform
a quantitative performance analysis on the two proposed hardware utilisation
techniques.
1.4 Ethics and Sustainability
The system described in this thesis is a research prototype. Therefore there
are no proper security measures in place, i.e. data encryption. Data analytics
frameworks may be used with malicious intent and while implementing
8
-
encryption mechanisms for the system is straightforward, it is not feasible to
control what its users do. Sustainability is another important aspect while
designing new systems, thus the algorithms that are applied should be energy
efficient.
1.5 Delimitations
The ecosystem of the Rust programming language is still evolving. For that
reason, the implementation in this thesis is limited to a smaller set of stable
libraries. However, this is not an issue as the system in this degree project is still
a research prototype.
9
-
2 Theoretical Background
In this chapter, the necessary background knowledge required for the remainder of the
thesis is covered.
2.1 Stream Processing
Stream processors are typically categorised into either a continuous or micro-
batch model. Continuous operator based streaming systems [11, 36, 41] express
applications as a DAG consisting of long-running operators, where records are
processed as they arrive. Fault-tolerance in this particular model is commonly
achieved by utilising a checkpointing mechanism [12, 17]. This streaming model
enables low-latency processing, although, at the cost of expensive fault recovery
when the state is large. Micro-batch based streaming systems [15, 22, 43] use the
same DAG abstraction, but gather streaming data into micro-batches, which are
then processed using the Bulk Synchronous Parallel (BSP) model. Computations
in this model are deterministic, which allows for efficient fault recovery through
lineage tracking and techniques like parallel recovery [43]. Although the model
is able to achieve high throughput, it suffers from downsides such as high latency
due to task scheduling, and requiring a centralised driver program. Throughout
the rest of this report, the continuous operator model will be assumed.
Event-time vs. Progress time. Stream processor’s typically support different
notion of time. Event-time refers to the time at which an event was created, e.g.,
the time an external device produced a record. Whereas progress time refers to
the system time of the current machine.
Watermarks. Events may be ingested in parallel from multiple sources. As
a result, the events may arrive out-of-order. The watermark notion is used to
measure progress in a stream processor that utilises event-time. A watermark
is a message that flows through the dataflow graph. It contains a timestamp t,
which indicates the lowest timestamp in the system. If a timestamp t is received,
then there should no longer be any elements in the stream with a time lower than
t.
10
-
2.1.1 Apache Flink
Apache Flink [11] is a distributed stream processing system which adheres to
the continuous operator model. Flink runs on the JVM (Java Virtual Machine)
and unifies batch processing and stream processing under a single engine. The
unification is achieved by adapting the Dataflow model [2], in which batch data
is treated as a finite stream. Flink handles out-of-order processing [29], provides
exactly-once semantics [10, 13], and integrates seamlessly with existing cluster
managers such as YARN [42] and Kubernetes [23].
Flink Application: Flink uses a declarative API enabling users to define their
applications in a concise way. Listing 6 showcases a simple Flink program using
the programming language Scala. The stream object on line 2 represents a stream
of unbounded integers. In this example, two common stream transformations are
applied. First, negative numbers are filtered away. Then, each positive integer is
multiplied by 5. Finally, the print method on line 5 instructs Flink to print out the
result. A stream transformation occurs when a stream operator transforms the
current stream into a new one, with potentially a new data type.
1 val env = StreamExecutionEnvironment.getExecutionEnvironment
2 val stream: DataStream[Int] = env.addSource(...)
3 stream.filter(i => i > 0)
4 .map(x => x * 5)
5 .print()
6
7 env.execute("Flink Example")
Listing6: Flink Application
The above application is turned into a dataflow graph as seen in figure 2.1.
User-Defined-Functions: Flink supports a variety of different stream
transformations [5]. Flink also has support for creating custom UDFs. If it is
the case that the ones provided are not enough. Each transformation or so called
11
-
Source
Sink
MapFilter Print
Figure2.1: Dataflow Graph of listing 6
Source
Sink
MapFilterPrint
OperatorChain
Figure2.2: Optimised Dataflow Graph
operator executes inside a Flink task. Each task runs in its own thread. It is
possible in Flink to fuse operators into a single task. In which each operator
becomes a subtask of an operator chain. Figure 2.2 shows an optimised version
of figure 2.1. The filter and map operators are now chained inside one task, thus
reducing the overhead of running them across different threads.
Thread Model: Flink runs a dedicated thread for each task. In short, each task
runs an infinite while loop, where it tries to consume messages from upstream
tasks. This approach enables high performance, but at the cost of idle resources if
the task is not continuously fed data.
Memory Management: Flink takes explicit control of memory management. It is
done by pre-allocating memory segments in which Flink serialises objects into. In
order to avoid the cost of serialisation and deserialisation, Flink tries to operate on
binary data as much as possible. As the data is in bytes, spilling to disk becomes
trivial. Using this approach, Flink is able to avoid an OutOfMemory error that
will shut down the JVM. Also, the pressure on the garbage collector is reduced as
Flink avoids putting large objects on the heap.
12
-
2.2 Event-based Middleware
Kompact [26] is an event-based middleware that combines the Component
Model of Kompics [7] together with the Actor Model [1]. An introduction of
aforementioned models will first be given before delving further into the Kompact
framework.
2.2.1 Actor Model
The Actor Model is a computational model based on asynchronous message-
passing. Actors encapsulate their own state and interact with other actors by
sending messages. Messages are stored in a mailbox that is tied to each actor.
Actors process messages from the mailbox in a FIFO order. Figure 2.3 showcases
a simple scenario where Actor A sends messages Actor B’s mailbox.
Actor A Actor BMessagsFIFO order
Figure2.3: Actors
In response to a message, an actor may take the following actions:
• send messages to other actors
• send a response
• change its state
• create more actors
• terminate itself
By using the actor model, the application complexity decreases as their is no need
for explicit thread synchronisation (e.g., Mutex/Locks). There are many notable
actor systems [4, 9] out there today, many of which are being used in production
in large companies.
13
-
2.2.2 Component Model
Similar to the Actor model, the component model is an event-based asynchronous
message- passing model that is suited for building distributed systems. The model
consists of a set of components that execute concurrently and interact through
statically-typed ports that are connected by FIFO ordered channels.
Figure2.4: Component Model
Component. A component has its own internal state which can only be accessed
through interaction with the component. This interaction occurs through the Port
interface.
Ports. A port supports bidirectional communication. Ports use the notion of
Request and Indication to express the direction of events. Components either
provide or require a port. As seen in figure 2.5, Request represents an incoming
event on this port. Thus, a component that utilises this port must implement
Provide in which it specifies how the message is to be handled. Whereas a Require
implementation is needed for the Indication of the port.
14
-
Port
Request Indication
Figure2.5: Port
2.2.3 Kompact
Kompact empowers system designers with the option of composing applications
using the component model, actor model, or a combination of them both. For
applications that need to spawn workers during execution, using the actor model
is then an excellent approach as it is more suitable for a dynamic environment.
Whereas if the application setup is predefined, using the component model may
be a more fitting choice. Depending on the application, it may also make sense to
combine the two models.
Kompact System. The Kompact System proivdes an execution runtime for
components. This includes scheduling, timers, and networking. The internal
scheduler aims to ensure fairness between the components running on the
Kompact System. Components are typically scheduled on top of a threadpool,
but it is also possible to plug in custom schedulers. It is also possible to tweak
other settings such as throughput, message priority, and number of threads used
by the scheduler.
Kompact Component. A Kompact component will always execute sequentially
in a single-threaded fashion and may be scheduled onto different threads over its
lifetime. Figure 2.6 depicts how Kompact components may be used to structure
applications.
Kompact Scheduling. Whenever a Kompact component is scheduled, the inner
execute function of the component will attempt to process events in the following
order: (1) Control Events (e.g., Stop/Kill Component); (2) Timers; (3) Messages
15
-
Figure2.6: Kompact Overview
(i.e. user-defined messages).
Kompact Networking. Components in Kompact may communicate over the
network by using the ActorPath object. Kompact currently only supports the TCP
protocol, however, it is possible to extend to support others (e.g., UDT/QUIC).
All connections to a remote Kompact System are multiplexed over a single
TCP channel. The underlying network implementation utilises the Tokio1
library which is the most popular network framework for systems in the Rust
ecosystem.
1https://tokio.rs
16
-
3 Motivation
This chapter is split into two sections. The first section will demonstrate through
a contrived experiment the need for stronger compiler integration in data stream
processing systems. The experiment aims to serve as a motivation for the
execution goals described in the ensuing section.
3.1 Hardware Acceleration through Computation Sharing
We have observed that common JVM based stream processors such as Apache
Flink fail to support low-level compiler optimisations and this may lead to
inefficient streaming pipelines due to non-optimised stream transformations.The
source code for this experiment is publicly available on Github [14].
Apache Flink treat user-defined-functions as black boxes and is therefore not
capable of optimising across functions. However, to improve performance, Flink
groups higher order functions into an operator-chain where each function is called
one by one through out the chain. The chaining approach leads to improved
performance compared to running the functions across threads (i.e. Flink tasks).
Figure 3.1 depicts four different levels of fusion using a streaming pipeline that
aims to perform an addition of 4 to the element it receives. (1) No Fusion. The
no-fusion level is a naïve version of the pipeline in which an user has written
four map transformations after each other. (2) Task Fusion. This optimisation
takes all four map transformations and puts them into an operator-chain that
executes inside a single Flink task. (3) Invocation-level Fusion. The invocation
fusion combines all x + 1 additions into a for-loop that is executed by a single
function. The invocation-level fusion was added to simulate the task fusion in
order to understand the cost of invoking multiple functions calls in the operator-
chain. (4) Instruction-level Fusion. The strongest fusion level achievable is at the
instruction-level where the stream transformations have been partially evaluated
into a single instruction (i.e. x + 4).
17
-
Figure3.1: Fusion Levels
The executed experiment follows the same structure as figure 3.1. However, the
map transformation is an addition of 50 instead of 4, which means that the no-
fusion approach will consist of 50 successive Flink tasks.
Experimental Setup. The experiment was executed on a Ubuntu 18.04 machine
with Intel i7-4510U CPU @ 2.00GHz processor using Apache Flink version 1.9.0
and Java OracleJDK 1.8.0_221.
Results. Figure 3.2 depicts the outcome of the experiment. The slowest execution
time (105 seconds) occurs when performing no fusion at all, that is, when disabling
the operator chaining in Flink. If the chaining is enabled (Task Fusion), then the
execution time goes down to 22 seconds, a 4.7x improvement. Instruction-level
fusion (1.2 seconds) shows a 3x speed up over the invocation optimisation and
18.3x over Task Fusion.
18
-
Figure3.2: Apache Flink Fusion Experiment
Analysis. The results show that there is a cost of doing inter-task calls within the
Flink operator-chain. Other than having to execute more instructions and switch
environment for each operator, Flink, according to a design proposal [20] currently
defaults to copy every element between operators in the chain. Typically, copying
objects on the JVM is expensive, especially when dealing with more advanced
data types such JSON or Avro. However, the benchmark being discussed only
deals with a trivial type (i.e Int) that should be virtually free to copy. Observation:
data scientists should not be required to have low-level system knowledge on the
data analytic framework they are using. The compiler needs to be an integral part
of the framework stack in order to guarantee that the uttermost efficient code is
executed.
19
-
3.2 Execution Goals
The design of Arcon’s streaming runtime is motivated by the following goals:
R1. The system must provide a high performance execution environment for
the critical path of the data processing. The environment must also ensure the
execution occurs without any form of disruption (e.g., garbage collection).
R2. The system must support both dataflow and low-level compiler
optimisations.
R3. The system must adopt a flexible event-based middleware that seamlessly
enables plugging in different scheduling implementations and network protocols
(e.g., TCP, QUIC) based on the application requirements.
Rust and its LLVM backend was chosen in order to tackle R1. The JVM has for a
long time been the de facto standard choice for data processing frameworks (e.g.,
Apache Spark, Apache Flink). However, it is far from perfect. JVM based systems
often execute slower than languages that compile down to efficient machine code
(e.g., C/C++, Rust). Another limitation is that the critical path may be interrupted
momentarily due to garbage collection. It is important to note that although there
are many upsides to move away from the , there are also drawbacks:
• Cloud Deployment. One of the advantages with building JVM-basedapplications is the ease of deployment in today’s cloud environments. It
is much easier to launch a application compared to an equivalent Rust
application. In Rust, one would need to ensure that the compiled binary
matches the host architecture and that its linked properly (i.e. if not statically
linked).
• Dynamic Code Deployment. Being able to dynamically update streaminglogic is an important feature for long-running pipelines. This operation
becomes more complicated if you have application code that is compiled
into native machine code.
Another important design goal (R2) is being able to support dataflow
optimisations (e.g., fusion, fission), but also low-level compiler techniques such
as partial evaluation and loop unrolling. To meet this requirement, Arcon’s
20
-
streaming runtime is heavily dependent on the Arc compiler to generate optimised
code (e.g., user-defined-functions) for its batch backend. To achieve R3, the
streaming runtime adopts Kompact, a hybrid concurrent component and actor
model middleware. Kompact offers great flexibility (e.g., simplicity of adding
new features) and provides out-of-the-box support for networking, scheduling,
and timers.
21
-
4 Design
This chapter will cover the design of Arcon’s streaming runtime. The First section
will present the system model and the second section will give an overview of the
architectural layout.
4.1 System Model
The system is composed of the following elements: processes, local channels, physical
network channels. At a high-level, the system can be defined as a set of processes
that are connected by a finite number of physical network channels. Within each
process there may exist a bounded number of channels for local message-passing.
State. Stateful processing is under development and is therefore beyond the
scope of this thesis.
Channels. Both local channels and the physical network channels take on reliable
FIFO order delivery, where the latter achieves it through the use of TCP.
Failure Model. A fail-stop [39] model is assumed, which means that if a single
process crashes due to software or machine failure, it is no longer part of the
system.
4.2 System Overview
The compiler pipeline in figure 4.1 presents the steps required to get from an Arc
application down to an Arcon Executable. The dataflow graph in figure 4.2 will be
used as an example in order to introduce the internals of the Arcon runtime. For
simplicity, a local (i.e. no networking) single binary setup is assumed. The graph
consists of two stream transformations, filter and map. Both transformations have
a parallelism of two. Figure 4.3 depicts how the aforementioned dataflow graph
is mapped down to runtime components.
22
-
Figure4.1: Arcon Compiler Pipeline
Figure4.2: Example Dataflow Graph
23
-
Figure4.3: Execution Layout for Figure 4.2
The execution layout of the Arcon runtime shown in figure 4.3 consist of the
following core elements:
• Arcon Nodes (i.e. Kompact Components)
• Kompact Scheduler
• Channels
Arcon Node. An Arcon Node (i.e. filter/map in figure 4.3) is a Kompact
component that receives ArconMessages and produces ArconMessages as output.
An ArconMessage contains the following: (1) Sender ID that is used to identify
channels; (2) a payload (e.g., watermark or stream element).
Arcon Nodes are loaded with an operator function (i.e. user-defined-function) that
is executed per received stream element. The operator function is generated by
the Arc compiler and may produce a single output event or multiple in the case
of a FlatMap function. Figure 4.4 shows the complete layout of a Node. The Node
abstraction plays a central role in the design as a complete streaming pipeline is
essentially a directed graph of Node components.
24
-
Figure4.4: Arcon Node
Kompact Scheduling. Arcon Nodes are driven by the Kompact scheduler. The
scheduler is instantiated with a number of workers (i.e. threads) as seen in figure
4.3. Typically, the amount of workers will be the number of CPU cores on the
machine in order to maximise the parallelisation. Each worker will attempt to
grab a scheduled component from its local job queue, however if the queue is
empty, it will try to steal from other workers. Arcon Nodes are only scheduled to
execute if they have work to accomplish (e.g., their message queue is non-empty).
To ensure fairness between components, each schedule iteration only processes
a pre-determined number of events (step 2 in figure 4.3) before asking to be
scheduled again. Thus, by increasing this number, one increases the throughput
of the component at the cost of potentially blocking other components from being
scheduled.
In contrast to Flink, Arcon, at the runtime level does not take on a continuous
long-running dataflow model, but rather a task-parallel approach. Multiple Arcon
Nodes may be multiplexed over a single thread (i.e. worker). Figure 4.5 illustrates
how this would look like using the example in figure 4.2 but with a parallelism of
4 for map and filter.
25
-
Figure4.5: Multiplexed Arcon Nodes
Channels. A channel in the Arcon runtime represents a connection to another
Arcon Node. Channels may either be local (ActorRef ) or remote (ActorPath). A
channel strategy refers to the choice of channel an Arcon Node decides to send on.
Typically, the strategy will be a forward one where there is a one-to-one mapping
between two Nodes. Other types of strategies are listed down below:
• Broadcast→ Send to all channels
• Random-Shuffle → Selects channel randomly based on a uniformdistribution
• Round-Robin→ Selects channel in a round-robin fashion
• KeyBy→ Selects channel through hash partitioning
26
-
5 Results
The experiment in this section aims to investigate the suitability of Arcon’s task-
parallel approach to scheduling streaming operators. Stream processing systems
such as Apache Flink use a dedicated thread model for streaming operators as it
provides high performance. However, this approach is more restrictive as it limits
the system to the number of cores on the machine. Arcon adopts a scheduling
model that may multiplex multiple streaming operators over the same thread in
order to gain more fine-grained parallelism.
To make it easier to follow along, the streaming pipelines in this chapter will
be presented using the Flink API rather than Arc IR. The repository for the
benchmarks in this chapter is available on Github [31].
5.1 Experimental Setup
Table 5.1 shows the list of specifications related to the experiments in this thesis.
The list includes both information about the execution environment, but also
versions of the frameworks that were used.
Type Value
Memory 16GJVM Heap Size 2048m
Physical CPU cores 4 (8 HT)OS Arch Linux 5.3.7
Processor Intel i7-4770 CPU @ 3.40GHzFlink 1.9.0Rust 1.39.0-nightlyJava OracleJDK build 1.8.0_221
Table5.1: Benchmark Specs
5.2 Experiment Context
Data-parallelism in streaming dataflows is typically achieved by partitioning
the stream based on some key. Apache Flink uses the notion of key-groups
27
-
to split stream records into parallel downstream operators. The parallelism of
the operator is usually less or equal to the number of CPU cores available. It is
rather hard to control or predict the distribution of the stream records entering
the system. It may happen that a majority of the records belong to a certain key-
group and thus overloading one partial operator. Figure 5.1 shows an example
with 4 key-groups (assuming a 4 core machine) with one of the partitions being
overwhelmed.
Figure5.1: Skewed Partitioning
The skewed partitioning in figure 5.1 leads to underutilisation of CPU cores and
thus potentially slowing down the performance of the pipeline. One solution to
the problem could be to add more key-groups in order to distribute the partitions
more evenly. Figure 5.2 depicts how the workload could be distributed if 4 more
key-groups are added.
Figure5.2: Balanced Partitioning
The key-group setup is now running 8 parallel operators using 4 CPU cores.
Utilising more parallel operators than available CPU cores is typically not
recommended while using a dedicated thread model due to context switching.
Arcon on the contrary is expected to perform better in such a setup as its work-
stealing scheduler is well suited to run more operators than cores.
28
-
Hypothesis. Arcon should outperform Flink given that a streaming pipeline
is skewed and that the number of key-group partitions exceeds the number of
CPU cores available.
The objective of the experiment is to study how both scheduling models
(Arcon/Flink) behave under a data-parallel pipeline and gain insight on the cost
and potential benefits of Arcon’s scheduling model.
5.3 Benchmark Description
The benchmark will revolve around a simple streaming pipeline consisting of the
following operators: source, map, and sink. The source node partitions data to X
number of map nodes. The partitioning occurs through the use of consistent hash
partitioning (i.e. hash modulo parallelism). Both Arcon and Flink use the same
Murmur3 [19] hash function in order to ensure that the partitions are equal in both
systems. The pipeline code is depicted in listing 7 and its equivalent dataflow
graph in figure 5.3.
1 case class Item(id: Int, number: Long, iterations: Int)
2 case class EnrichedItem(id: Int, root: Double)
3 val env = StreamExecutionEnvironment.getExecutionEnvironment
4 val source: DataStream[Item] = env.addSource(...)
5 source.keyBy(_.id)
6 .map(item => {
7 val root = newtonSqrt(item.number, item.iterations)
8 new EnrichedItem(item.id, root)
9 }).addSink(...).setParallelism(1)
10
11 env.execute("Benchmark Pipeline")
Listing7: Benchmark Pipeline
Data Generation. As seen on line 6 in listing 7, the Item object is being keyed
on the id field. The data generator produces ids between the 1-50 range with its
29
-
Figure5.3: Data-parallel dataflow graph
distribution depending on which mode is used, that is, skewed or uniform. The
second field of the Item object is a number between 1-100 that will be used for the
workload function. Lastly, the iterations field is used to control the amount of
work the function will do.
Workload Function. Each Map operator executes per received stream element
the square root function depicted in listing 8. This particular workload was
selected because it is fair (i.e. 64-bit floating point operations) between Rust and
the JVM, but also that the amount of workload is controllable by the number of
iterations.
1 def newtonSqrt(square: Long, iters: Int): Double = {
2 val target = square.toDouble
3 var currentGuess = target
4 for( _
-
Flink Execution Setup. The Flink setup as illustrated in figure 5.4 remains
the same during the benchmarks. The TaskManager contains 4 TaskSlots and
therefore allowing a maximum parallelism of 4.
Figure5.4: Flink Pipeline Setup
Arcon Execution Setup. In order to make it more fair with Flink, both the source
and sink components will run on dedicated threads. The rest of the components
(i.e. mappers) are scheduled on top of the work-stealing scheduler. Figure 5.5
shows an example using a parallelism (p) of 8 and 4 scheduling threads (t).
Kompact’s throughput setting was set to 50 for the benchmarks in order to ensure
fairness between components. That is, each component processes a maximum of
50 messages before being rescheduled.
Figure5.5: Arcon Pipeline Setup
31
-
5.4 Results
The results are measured by the overall execution time it takes for the sink to
finish processing the 10 million stream elements that it is expecting. The final
execution time is gathered by doing five runs and then taking the average. The
baseline number of partitions is 4 as our experimental setup has 4 physical cores,
thus allowing a maximum of 4 parallel operators in Flink. The benchmark was
executed with two types of partition distributions. The first type was generated
based on a uniform distribution with its partitions resulting in a (24.6%, 29.5%,
21.3%, 24.6%) split. Whereas the second type was intentionally generated as
skewed (51.3%, 19.7%, 6.6%, 22.4%). Arcon is executed with a range of different
map parallelism/partitions (p) and scheduling threads (t), while Flink assumes a
(p=4, t=4) setup. Three different kind of workloads were executed, low, medium,
and high. The low workload ran with 100 newton iterations while medium and
high, 500, and 1000, respectively.
Figure 5.6 shows the results of the benchmark pipeline when running with a
uniform distribution with its result summarised as following:
Low Workload. Flink is 39.34% faster than Arcon’s baseline setup (p=4, t=4) and
25.21% faster than the best performing Arcon setup (i.e. p=6, t=3).
Medium Workload. Flink outperforms Arcon by 12.84% for the baseline setup
and by 9.38% for the best run (p=12,t=4).
High Workload. Arcon’s baseline is 10.21% slower than Flink during high
workload and 5.72% slower when running with a p=8,t=4 setup.
32
-
05
1015202530354045
Low Medium High
Exec
utio
n tim
e (s
econ
ds)
Map Operator Workload
Uniform Data p4=(24.6%, 29.5%, 21.3%, 24.6%) 10M Elements
Flink, p=4, t=4Arcon, p=4, t=4Arcon, p=4, t=2Arcon, p=6, t=3Arcon, p=8, t=4
Arcon, p=12, t=4
Figure5.6: Uniform data benchmark using three different workload scenarios.
The next graph (figure 5.7) depicts the benchmark outcome when running with a
skewed data distribution. The results are summed up down below :
Low Workload. Flink finishes 33.57% faster than the baseline setup (p=4,t=4) and
1.81% when comparing to Arcon’s best performing setup (p=6,t=3).
Medium Workload. Arcon is 18.24% slower with its baseline setup but 32.29%
faster than Flink when running with p=12,t=4.
High Workload. Flink is 13.28% faster than Arcon’s baseline, but 19.54% slower
compared to Arcon (p=12,t=4).
Partition distribution. The distribution with a parallelism of 4 is depicted in
figure 5.7 while the other distributions are as following: p=6: (17.08%, 6.56%,
32.89%, 5.26%, 17.11%, 21.07%). p=8: (31.58%, 5.25%, 5.24%, 19.73%, 10.53%,
9.22%, 17.10%, 1.31%). p=12: (5.25%, 27.62%, 11.84%, 2.62%, 2.63%, 2.63%,
5.26%, 1.30%, 5.27%, 11.85%, 9.22%, 14.46%).
33
-
0
10
20
30
40
50
60
Low Medium High
Exec
utio
n tim
e (s
econ
ds)
Map Operator Workload
Skewed Data p4=(51.3%, 19.7%, 6.6%, 22.4%) 10M Elements
Flink, p=4, t=4Arcon, p=4, t=4Arcon, p=4, t=2Arcon, p=6, t=3Arcon, p=8, t=4
Arcon, p=12, t=4
Figure5.7: Skewed data benchmark using three different workload scenarios.
5.5 Analysis
The results show that there is a noticeable scheduling overhead for Arcon when
the workload is low. Flink outperforms Arcon in the low workload scenario for
both uniform and skewed data. This is likely due to the work-stealing scheduler
not getting enough work and momentarily going to sleep. Seeing Arcon perform
better with p=4,t=2 than p=4,t=4 helps support the aforementioned assumption.
It should be noted that Arcon’s p=6,t=3 setup performs reasonably well in all runs
despite using one less mapper thread than Flink.
Another observation is that during medium and high workload scenarios, Arcon
pays 5-10% in scheduling overhead while running with uniform partitions, but
has the potential to gain up to 20-35% in performance when handling skewed
data distributions. Arcon is able to tackle the skewed data problem by creating
finer-grained partitions. Multiple partitions may be multiplexed over a single
thread in order to ensure better utilisation and therefore performance. Arcon
does not benefit much by running with an equal assignment of components to
threads (i.e. p = t), and achieves the best results with p >= t ∗ 2. Both p=8,t=4 andp=12,t=4 are able to perform well in both data distributions.
34
-
The data supports the hypothesis laid out in 5.2. Arcon surpasses Flink in
performance while experiencing skewed data by creating finer-grained partitions
and thus distributing the incoming data more evenly. In general, it is hard to
outperform a dedicated thread model in raw performance as it is continuously
consumes and processes stream elements without any scheduling overhead.
However, as seen by the results, it is feasible to get comparable results while
also enabling a more flexible thread-model. It opens up the possibility of sharing
batch and stream computation using the same scheduler.
35
-
6 Conclusions
Distributed stream processing systems ingest unbounded data and aim to
guarantee low-latency processing while providing high-throughput capabilities.
In addition, they commonly support the ability to scale out over hundreds of
commodity machines. Although there has been a significant amount of research
put into the field, there are still aspects that can be improved. Existing production-
grade stream processors lack proper compiler integration and therefore miss
out on notable low-level compiler optimisations (e.g., fusion of functions at the
instruction-level). Another observation is that the commonly used dedicated
thread model for streaming operators is not well suited for handling skewed
pipelines. It may happen that one or two parallel operators get most of the work,
thus leaving the remaining threads (cores) underutilised. This thesis presented
the design of a Rust based stream processor that addresses the aforementioned
limitations. The compiler integration occurs through Arc, a language for batch-
and-stream programming that provides both dataflow and low-level compiler
optimisations. Chapter 3 included an analysis of how an existing stream processor
(Apache Flink) could benefit by supporting more aggressive optimisations as
such provided by Arc. The analysis served as a motivation for the design laid
out in chapter 4. Lastly, the work-stealing based scheduling model is evaluated
against Apache Flink in chapter 5. The evaluation show promising results for
a proof-of-concept system compared to a state-of-the-art stream processor that
has seen numerous optimisations over the years. The prototype system is able
to outperform Apache Flink in skewed distributions while achieving comparable
results during uniform data distributions.
6.1 Future Work
There is plenty of interesting follow-up work to this thesis where the most notable
ones are listed down below:
• Low Workload Performance. The results in chapter 5 showed that Arconhad trouble competing with Flink in the low workload scenario. This needs
to be investigated. It might just be that the workers of the scheduler are not
36
-
getting enough work and therefore parking their threads temporarily.
• Advanced Benchmarks. It would be interesting to run more complexbenchmark pipelines that would include parallel sources/sinks, distributed
execution, and more systems, e.g., Structured Streaming.
• State Management. Arcon does currently not support stateful processing.A natural next-step would be to integrate a state backend (e.g., RocksDB)
and include state management in future evaluations.
37
-
Acronyms
BSP Bulk Synchronous Parallel.
DAG Directed Acyclic Graph.
JVM Java Virtual Machine.
References
[1] Agha, Gul. Actors: A Model of Concurrent Computation in Distributed Systems.
Cambridge, MA, USA: MIT Press, 1986. isbn: 0-262-01092-5.
[2] Akidau, Tyler et al. “The Dataflow Model: A Practical Approach to
Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded,
Out-of-order Data Processing”. In: Proc. VLDB Endow. 8.12 (Aug. 2015),
pp. 1792–1803. issn: 2150-8097. doi: 10 . 14778 / 2824032 . 2824076. url:
http://dx.doi.org/10.14778/2824032.2824076.
[3] Akka: build concurrent, distributed, and resilient message-driven applications for
Java and Scala | Akka. https://akka.io/. (Accessed on 05/10/2019).
[4] Akka: build concurrent, distributed, and resilient message-driven applications for
Java and Scala | Akka. https://akka.io/. (Accessed on 07/23/2019).
[5] Apache Flink 1.8 Documentation: Operators. https : / / ci . apache . org /
projects / flink / flink - docs - stable / dev / stream / operators/.
(Accessed on 07/15/2019).
[6] Apache Hadoop. https://hadoop.apache.org/. (Accessed on 02/13/2019).
38
http://dx.doi.org/10.14778/2824032.2824076http://dx.doi.org/10.14778/2824032.2824076https://akka.io/https://akka.io/https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/https://hadoop.apache.org/
-
[7] Arad, Cosmin, Dowling, Jim, and Haridi, Seif. “Message-passing
Concurrency for Scalable, Stateful, Reconfigurable Middleware”. In:
Proceedings of the 13th International Middleware Conference. Middleware ’12.
ontreal, Quebec, Canada: Springer-Verlag New York, Inc., 2012, pp. 208–
228. isbn: 978-3-642-35169-3. url: http://dl.acm.org/citation.cfm?id=
2442626.2442640.
[8] Armbrust, Michael et al. “Structured Streaming: A Declarative API for Real-
Time Applications in Apache Spark”. In: Proceedings of the 2018 International
Conference on Management of Data. SIGMOD ’18. Houston, TX, USA: ACM,
2018, pp. 601–613. isbn: 978-1-4503-4703-7. doi: 10.1145/3183713.3190664.
url: http://doi.acm.org/10.1145/3183713.3190664.
[9] Armstrong, Joe. Programming Erlang: Software for a Concurrent World.
Pragmatic Bookshelf, 2007. isbn: 193435600X, 9781934356005.
[10] Carbone, Paris. “Scalable and Reliable Data Stream Processing”. QC
20180823. PhD thesis. KTH, Software and Computer systems, SCS, 2018,
p. 180. isbn: 978-91-7729-901-1.
[11] Carbone, Paris et al. “Apache FlinkTM: Stream and Batch Processing in a
Single Engine”. In: IEEE Data Eng. Bull. 38.4 (2015), pp. 28–38.
[12] Carbone, Paris et al. “Lightweight Asynchronous Snapshots for Distributed
Dataflows”. In: CoRR abs/1506.08603 (2015). arXiv: 1506.08603. url: http:
//arxiv.org/abs/1506.08603.
[13] Carbone, Paris et al. “State Management in Apache Flink&Reg;: Consistent
Stateful Distributed Stream Processing”. In: Proc. VLDB Endow. 10.12 (Aug.
2017), pp. 1718–1729. issn: 2150-8097. doi: 10.14778/3137765.3137777.
url: https://doi.org/10.14778/3137765.3137777.
[14] cda-group/flink_forward_arc: Experiment used in the Flink Forward 2019 Arc
presentation. https : / / github . com / cda - group / flink _ forward _ arc.
(Accessed on 11/13/2019).
[15] Chambers, Craig et al. “FlumeJava: Easy, Efficient Data-parallel Pipelines”.
In: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language
Design and Implementation. PLDI ’10. Toronto, Ontario, Canada: ACM, 2010,
39
http://dl.acm.org/citation.cfm?id=2442626.2442640http://dl.acm.org/citation.cfm?id=2442626.2442640http://dx.doi.org/10.1145/3183713.3190664http://doi.acm.org/10.1145/3183713.3190664http://arxiv.org/abs/1506.08603http://arxiv.org/abs/1506.08603http://arxiv.org/abs/1506.08603http://dx.doi.org/10.14778/3137765.3137777https://doi.org/10.14778/3137765.3137777https://github.com/cda-group/flink_forward_arc
-
pp. 363–375. isbn: 978-1-4503-0019-3. doi: 10.1145/1806596.1806638. url:
http://doi.acm.org/10.1145/1806596.1806638.
[16] Chandramouli, Badrish et al. “Trill: A High-performance Incremental Query
Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),
pp. 401–412. issn: 2150-8097. doi: 10.14778/2735496.2735503. url: http:
//dx.doi.org/10.14778/2735496.2735503.
[17] Chandy, K. Mani and Lamport, Leslie. “Distributed Snapshots: Determining
Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb.
1985), pp. 63–75. issn: 0734-2071. doi: 10.1145/214451.214456. url: http:
//doi.acm.org/10.1145/214451.214456.
[18] Dean, Jeffrey and Ghemawat, Sanjay. “MapReduce: Simplified Data
Processing on Large Clusters”. In: Proceedings of the 6th Conference on
Symposium on Opearting Systems Design & Implementation - Volume 6.
OSDI’04. San Francisco, CA: USENIX Association, 2004, pp. 10–10. url:
http://dl.acm.org/citation.cfm?id=1251254.1251264.
[19] flink/MathUtils.java at master ů apache/flink. https://github.com/apache/
flink/blob/master/flink-core/src/main/java/org/apache/flink/
util/MathUtils.java#L134. (Accessed on 11/13/2019).
[20] FLIP-21 - Improve object Copying/Reuse Mode for Streaming Runtime - Apache
Flink - Apache Software Foundation. https : / / cwiki . apache . org /
confluence/pages/viewpage.action?pageId=71012982. (Accessed on
10/14/2019).
[21] Han, Minyang and Daudjee, Khuzaima. “Giraph Unchained: Barrierless
Asynchronous Parallel Execution in Pregel-like Graph Processing Systems”.
In: Proc. VLDB Endow. 8.9 (May 2015), pp. 950–961. issn: 2150-8097. doi:
10.14778/2777598.2777604. url: https://doi.org/10.14778/2777598.
2777604.
[22] He, Bingsheng et al. “Comet: Batched Stream Processing for Data Intensive
Distributed Computing”. In: Proceedings of the 1st ACM Symposium on Cloud
Computing. SoCC ’10. Indianapolis, Indiana, USA: ACM, 2010, pp. 63–74.
isbn: 978-1-4503-0036-0. doi: 10.1145/1807128.1807139. url: http://doi.
acm.org/10.1145/1807128.1807139.
40
http://dx.doi.org/10.1145/1806596.1806638http://doi.acm.org/10.1145/1806596.1806638http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.14778/2735496.2735503http://dx.doi.org/10.1145/214451.214456http://doi.acm.org/10.1145/214451.214456http://doi.acm.org/10.1145/214451.214456http://dl.acm.org/citation.cfm?id=1251254.1251264https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/util/MathUtils.java##L134https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71012982https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71012982http://dx.doi.org/10.14778/2777598.2777604https://doi.org/10.14778/2777598.2777604https://doi.org/10.14778/2777598.2777604http://dx.doi.org/10.1145/1807128.1807139http://doi.acm.org/10.1145/1807128.1807139http://doi.acm.org/10.1145/1807128.1807139
-
[23] Hightower, Kelsey, Burns, Brendan, and Beda, Joe. Kubernetes: Up and
Running Dive into the Future of Infrastructure. 1st. O’Reilly Media, Inc., 2017.
isbn: 1491935677, 9781491935675.
[24] Hirzel, Martin et al. “A Catalog of Stream Processing Optimizations”. In:
ACM Comput. Surv. 46.4 (Mar. 2014), 46:1–46:34. issn: 0360-0300. doi: 10.
1145/2528412. url: http://doi.acm.org/10.1145/2528412.
[25] Koliousis, Alexandros et al. “SABER: Window-Based Hybrid Stream
Processing for Heterogeneous Architectures”. In: Proceedings of the 2016
International Conference on Management of Data. SIGMOD ’16. San Francisco,
California, USA: ACM, 2016, pp. 555–569. isbn: 978-1-4503-3531-7. doi: 10.
1145/2882903.2882906. url: http://doi.acm.org/10.1145/2882903.
2882906.
[26] Kroll, Lars. “Compile-time Safety and Runtime Performance in
Programming Frameworks for Distributed Systems”. PhD thesis. KTH,
Software and Computer systems, SCS, 2020. Unpublished.
[27] Kroll, Lars et al. “Arc: An IR for Batch and Stream Programming”. In:
Proceedings of the 17th ACM SIGPLAN International Symposium on Database
Programming Languages. DBPL 2019. Phoenix, AZ, USA: ACM, 2019, pp. 53–
58. isbn: 978-1-4503-6718-9. doi: 10.1145/3315507.3330199. url: http:
//doi.acm.org/10.1145/3315507.3330199.
[28] Kulkarni, Sanjeev et al. “Twitter Heron: Stream Processing at Scale”. In:
Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–
250. isbn: 978-1-4503-2758-9. doi: 10.1145/2723372.2742788. url: http:
//doi.acm.org/10.1145/2723372.2742788.
[29] Li, Jin et al. “Out-of-order Processing: A New Architecture for High-
performance Stream Systems”. In: Proc. VLDB Endow. 1.1 (Aug. 2008),
pp. 274–288. issn: 2150-8097. doi: 10.14778/1453856.1453890. url: http:
//dx.doi.org/10.14778/1453856.1453890.
[30] Martn Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous
Systems. Software available from tensorflow.org. 2015. url: http : / /
tensorflow.org/.
41
http://dx.doi.org/10.1145/2528412http://dx.doi.org/10.1145/2528412http://doi.acm.org/10.1145/2528412http://dx.doi.org/10.1145/2882903.2882906http://dx.doi.org/10.1145/2882903.2882906http://doi.acm.org/10.1145/2882903.2882906http://doi.acm.org/10.1145/2882903.2882906http://dx.doi.org/10.1145/3315507.3330199http://doi.acm.org/10.1145/3315507.3330199http://doi.acm.org/10.1145/3315507.3330199http://dx.doi.org/10.1145/2723372.2742788http://doi.acm.org/10.1145/2723372.2742788http://doi.acm.org/10.1145/2723372.2742788http://dx.doi.org/10.14778/1453856.1453890http://dx.doi.org/10.14778/1453856.1453890http://dx.doi.org/10.14778/1453856.1453890http://tensorflow.org/http://tensorflow.org/
-
[31] Max-Meldrum/arcon_benchmarks. https : / / github . com / Max - Meldrum /
arcon_benchmarks. (Accessed on 11/13/2019).
[32] McKinney, Wes. “Data Structures for Statistical Computing in Python”. In:
Proceedings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt
and Jarrod Millman. 2010, pp. 51–56.
[33] Meldrum, Max et al. “Arcon: Continuous and Deep Data Stream Analytics”.
In: Proceedings of Real-Time Business Intelligence and Analytics. BIRTE 2019.
Los Angeles, CA, USA: ACM, 2019, 3:1–3:3. isbn: 978-1-4503-7660-0. doi:
10.1145/3350489.3350492. url: http://doi.acm.org/10.1145/3350489.
3350492.
[34] Miao, Hongyu et al. “StreamBox: Modern Stream Processing on a Multicore
Machine”. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17).
Santa Clara, CA: USENIX Association, 2017, pp. 617–629. isbn: 978-1-931971-
38-6. url: https://www.usenix.org/conference/atc17/technical-
sessions/presentation/miao.
[35] Moritz, Philipp et al. “Ray: A Distributed Framework for Emerging AI
Applications”. In: 13th USENIX Symposium on Operating Systems Design and
Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 2018,
pp. 561–577. url: https : / / www . usenix . org / conference / osdi18 /
presentation/nishihara.
[36] Murray, Derek G. et al. “Naiad: A Timely Dataflow System”. In: Proceedings
of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP
’13. Farminton, Pennsylvania: ACM, 2013, pp. 439–455. isbn: 978-1-4503-
2388-8. doi: 10.1145/2517349.2522738. url: http://doi.acm.org/10.
1145/2517349.2522738.
[37] Noghabi, Shadi A. et al. “Samza: Stateful Scalable Stream Processing at
LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn:
2150-8097. doi: 10.14778/3137765.3137770. url: https://doi.org/10.
14778/3137765.3137770.
[38] Palkar, Shoumik et al. “Weld: A common runtime for high performance data
analytics”. In: Conference on Innovative Data Systems Research (CIDR). 2017.
42
https://github.com/Max-Meldrum/arcon_benchmarkshttps://github.com/Max-Meldrum/arcon_benchmarkshttp://dx.doi.org/10.1145/3350489.3350492http://doi.acm.org/10.1145/3350489.3350492http://doi.acm.org/10.1145/3350489.3350492https://www.usenix.org/conference/atc17/technical-sessions/presentation/miaohttps://www.usenix.org/conference/atc17/technical-sessions/presentation/miaohttps://www.usenix.org/conference/osdi18/presentation/nishiharahttps://www.usenix.org/conference/osdi18/presentation/nishiharahttp://dx.doi.org/10.1145/2517349.2522738http://doi.acm.org/10.1145/2517349.2522738http://doi.acm.org/10.1145/2517349.2522738http://dx.doi.org/10.14778/3137765.3137770https://doi.org/10.14778/3137765.3137770https://doi.org/10.14778/3137765.3137770
-
[39] Schlichting, Richard D. and Schneider, Fred B. “Fail-stop Processors: An
Approach to Designing Fault-tolerant Computing Systems”. In: ACM Trans.
Comput. Syst. 1.3 (Aug. 1983), pp. 222–238. issn: 0734-2071. doi: 10.1145/
357369.357371. url: http://doi.acm.org/10.1145/357369.357371.
[40] Shvachko, Konstantin et al. “The Hadoop Distributed File System”. In:
Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and
Technologies (MSST). MSST ’10. Washington, DC, USA: IEEE Computer
Society, 2010, pp. 1–10. isbn: 978-1-4244-7152-2. doi: 10.1109/MSST.2010.
5496972. url: http://dx.doi.org/10.1109/MSST.2010.5496972.
[41] Toshniwal, Ankit et al. “Storm@Twitter”. In: Proceedings of the 2014 ACM
SIGMOD International Conference on Management of Data. SIGMOD ’14.
Snowbird, Utah, USA: ACM, 2014, pp. 147–156. isbn: 978-1-4503-2376-5.
doi: 10.1145/2588555.2595641. url: http://doi.acm.org/10.1145/
2588555.2595641.
[42] Vavilapalli, Vinod Kumar et al. “Apache Hadoop YARN: Yet Another
Resource Negotiator”. In: Proceedings of the 4th Annual Symposium on Cloud
Computing. SOCC ’13. Santa Clara, California: ACM, 2013, 5:1–5:16. isbn:
978-1-4503-2428-1. doi: 10.1145/2523616.2523633. url: http://doi.acm.
org/10.1145/2523616.2523633.
[43] Zaharia, Matei et al. “Discretized Streams: Fault-tolerant Streaming
Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM,
2013, pp. 423–438. isbn: 978-1-4503-2388-8. doi: 10.1145/2517349.2522737.
url: http://doi.acm.org/10.1145/2517349.2522737.
[44] Zaharia, Matei et al. “Spark: Cluster Computing with Working Sets”. In:
Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing.
HotCloud’10. Boston, MA: USENIX Association, 2010, pp. 10–10. url: http:
//dl.acm.org/citation.cfm?id=1863103.1863113.
[45] Zeuch, Steffen et al. “Analyzing Efficient Stream Processing on Modern
Hardware”. In: Proc. VLDB Endow. 12.5 (Jan. 2019), pp. 516–530. issn: 2150-
8097. doi: 10.14778/3303753.3303758. url: https://doi.org/10.14778/
3303753.3303758.
43
http://dx.doi.org/10.1145/357369.357371http://dx.doi.org/10.1145/357369.357371http://doi.acm.org/10.1145/357369.357371http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1109/MSST.2010.5496972http://dx.doi.org/10.1145/2588555.2595641http://doi.acm.org/10.1145/2588555.2595641http://doi.acm.org/10.1145/2588555.2595641http://dx.doi.org/10.1145/2523616.2523633http://doi.acm.org/10.1145/2523616.2523633http://doi.acm.org/10.1145/2523616.2523633http://dx.doi.org/10.1145/2517349.2522737http://doi.acm.org/10.1145/2517349.2522737http://dl.acm.org/citation.cfm?id=1863103.1863113http://dl.acm.org/citation.cfm?id=1863103.1863113http://dx.doi.org/10.14778/3303753.3303758https://doi.org/10.14778/3303753.3303758https://doi.org/10.14778/3303753.3303758
-
TRITA TRITA-EECS-EX-2019:799
www.kth.se
IntroductionBackgroundDataflow ProcessingRustThe Arcon System
ContributionsMethodologyEthics and SustainabilityDelimitations
Theoretical BackgroundStream ProcessingApache Flink
Event-based MiddlewareActor ModelComponent ModelKompact
MotivationHardware Acceleration through Computation SharingExecution Goals
DesignSystem ModelSystem Overview
ResultsExperimental SetupExperiment ContextBenchmark DescriptionResultsAnalysis
ConclusionsFuture Work
References