distributed computing with data distribution service dds

71
Copyright PrismTech, 2016 Angelo Corsaro, PhD CTO, ADLINK Tech. Inc. Co-Chair, OMG DDS-SIG [email protected] Distributed Computing with DDS

Upload: prismtech

Post on 16-Apr-2017

352 views

Category:

Software


3 download

TRANSCRIPT

Cop

yrig

ht P

rism

Tech

, 201

6

AngeloCorsaro,PhDCTO,ADLINKTech.Inc.Co-Chair,OMGDDS-SIG

[email protected]

Distributed ComputingwithDDS

Cop

yrig

ht P

rism

Tech

, 201

5

Distributed Systems

Copy

right

Pris

mTe

ch, 2

015

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.

— Wikipedia

Distributed System Definition

Copy

right

Pris

mTe

ch, 2

015

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.

— Wikipedia

Distributed System DefinitionWell…Well..Well… This may be true at the transport level, but the components

may coordinate using different models as we’ll see later.

Copy

right

Pris

mTe

ch, 2

015

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.

— Adapted from Wikipedia

Distributed System Definition

Copy

right

Pris

mTe

ch, 2

015

A distributed system is a model in which components located on networked computers communicate and coordinate their actions to achieve a common goal.

Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.

— Adapted from Wikipedia

Distributed System Definition

Copy

right

Pris

mTe

ch, 2

015 A distributed system is one in which the failure of a

computer you didn't even know existed can render your own computer unusable.

— Leslie Lamport, 28 May 1987

Distributed System Definition

Copy

right

Pris

mTe

ch, 2

015 A distributed system is one in which the failure of a

computer you didn't even know existed can render your own computer unusable.

— Leslie Lamport, 28 May 1987

Distributed System Definition

Cop

yrig

ht P

rism

Tech

, 201

5Distributed Computing/Coordination Models

Copy

right

Pris

mTe

ch, 2

015Computation models are symmetric if the processes

involved in the distributed computation don’t assume special roles, in other terms they are all peers

Computational models are asymmetric if some processes assume a special role, i.e. server or client

Symmetric vs asymmetric

Copy

right

Pris

mTe

ch, 2

015Some distributed computational / coordination model

support anonymous communication in the sense that communication parties are unaware of each other

Other, require explicit knowledge of parties with which communication has to happen

Anonymous vs Named

Copy

right

Pris

mTe

ch, 2

015Message passing is a

symmetric computation model in which distributed

processes communicate and cooperate

asynchronously sending messages to each other

Examples: Sockets, Agents, MPI

Message Passing

msg

msg msg

Process

msg

Process Process

Process

Copy

right

Pris

mTe

ch, 2

015Client/Server is an

asymmetric computation model in which distributed

processes communicate and cooperate by requesting

services (often synchronously) to special processes called servers

Examples: Java RMI, CORBA, OPC-UA

Client/Server

request

replyClientServer

requestreply

Server

Client

requestreply

request

reply

Client

Copy

right

Pris

mTe

ch, 2

015

Message Queues is a symmetric and anonymous computation model in which

distributed processes communicate and coordinate by

asynchronously putting and getting messages on named

queues

Examples: AMQP, JMS Queues, AWS SQS

Message QueuesProcess

Process

Process

Process

put

get

Processget

get

put

get

get

put

Queue

Queue

Queue

Copy

right

Pris

mTe

ch, 2

015

The Tuple Space is a symmetric and anonymous computation

model in which distributed processes communicate and

coordinate by asynchronously reading and writing tuples, i.e.

data, into a tuple-space

Examples: DBMS, DDS, Linda

Tuple Spaces

Process

ProcessProcess

Process

Process

Process

wri

te

read

| ta

ke

write

read | take

write

read | take

write

read | take

wri

te

read

| ta

kewrite

read | take

Cop

yrig

ht P

rism

Tech

, 201

5

DDS

Copy

right

Pris

mTe

ch, 2

015

DDS provides a Tuple Space inspired symmetric computation

model in which distributed processes communicate and

coordinate by asynchronously reading and writing data into an

eventually consistent data space

The DDS ModeL

Process

ProcessProcess

Process

Process

Process

wri

te

read

| ta

ke

write

read | take

write

read | take

write

read | take

wri

te

read

| ta

kewrite

read | take

Copy

right

Pris

mTe

ch, 2

015Applications can

autonomously and asynchronously read and

write data enjoying spatial and temporal

decoupling

Virtualised Data Space

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015

Virtualised Data Space

Copy

right

Pris

mTe

ch, 2

015

Virtualised Data Space

Data Writer

Copy

right

Pris

mTe

ch, 2

015

Virtualised Data Space

Data Writer

Copy

right

Pris

mTe

ch, 2

015

Virtualised Data Space

Copy

right

Pris

mTe

ch, 2

015

Virtualised Data Space

Data Reader

Copy

right

Pris

mTe

ch, 2

015DDS’s Data Space is

eventually consistent with respect to writes

That means that readers of some kind of data will

eventually see a write, but they may not observe it at

the “same time”

CONSISTENCYMODEL

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015Given a property P(t) we say that this

property is eventually true iff:

Eventual Properties

Copy

right

Pris

mTe

ch, 2

015

Consistency with respect to a datum means that anything/anybody looking at the datum will see exactly the same value.

Eventually Consistent means that consistency will be “eventually” asserted, but before t* (which in unknown in asynchronous and partially synchronous systems), anything/anybody looking at the datum may see different values.

Understanding eventual consistency

Copy

right

Pris

mTe

ch, 2

015

A Topic defines a domain-wide information’s class by a

<name, type, qos> triple

DDS Topics allow to express functional and non-

functional properties of a system information model

Topic

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

TopicType

Name

QoS

Topic types can be expressed using

different syntaxes, including IDL and

ProtoBuf

Topic Type IDL

structTemperatureSensor{@keylongsid;floattemp;floathum;}

Copy

right

Pris

mTe

ch, 2

015

Each unique key value identifies a unique stream of data

DDS demultiplexes “streams” and provides per-instance

lifecycle information

A Writer can write multiple instances

Instances

Topic

InstancesInstances

sid =”12345”

sid =”54321”

sid =”15243”

structTemperatureSensor{@keylongsid;floattemp;floathum;};

Cop

yrig

ht P

rism

Tech

, 201

5Reader & Writer Caches

Copy

right

Pris

mTe

ch, 2

015Each Writer and Reader

have an associated Data Cache

Data Cache

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015The writer’s cache stores

(a subset of) the data written

Writer Cache

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015The reader’s cache

contains a projection of the global data space that reflect the reader

“interest”

Reader Cache

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015

A Reader/Writer Cache can stores the last n∊𝜨∞ samples

for each relevant instance.

The cache properties are configured via QoS.

Data Cache

Data Cache

...

Samples

Instances

Cache

Where: 𝜨∞=𝜨 ∪ {∞}

Copy

right

Pris

mTe

ch, 2

015The action of reading

samples for a Reader Cache is non-destructive.

Samples are not removed from the cache

Reading Samples

DataReader Cache

DataReader

...

DataReader Cache

DataReader

...read

Copy

right

Pris

mTe

ch, 2

015The action of taking

samples for a Reader Cache is destructive.

Samples are removed from the cache

taking samples

DataReader Cache

DataReader

...takeDataReader Cache

DataReader

...

Copy

right

Pris

mTe

ch, 2

015Samples can be selected

using composable content and status

predicates

Sample selectors

DataReader Cache

DataReader

...

Copy

right

Pris

mTe

ch, 2

015Filters allow to control what

gets into a DataReader cache

Filters are expressed as SQL where clauses or as Java/C/

JavaScript predicates

Data filters

DataReader Cache

DataReader

...

Filter

Application

Network

Copy

right

Pris

mTe

ch, 2

015Queries allow to control

what gets out of a DataReader Cache

Queries are expressed as SQL where clauses or as

Java/C/JavaScript predicates

Data Queries

DataReader Cache

DataReader

...

Query

DataReader Cache

DataReader

...

Application

Network

Copy

right

Pris

mTe

ch, 2

015State based selection

allows to control what gets out of a DataReader Cache based on samples (read or

not), instance (alive or not) and view (known or

not) states

State Selectors

DataReader Cache

DataReader

...

State Selector

DataReader Cache

DataReader

...

Application

Network

Copy

right

Pris

mTe

ch, 2

015QoS policies allow the

expression and control over data’s temporal

and availability constraints

QoS Enabled

DDS Global Data Space

...

Data Writer

Data Writer

Data Writer

Data Reader

Data Reader

Data Reader

Data Reader

Data Writer

TopicAQoS

TopicBQoS

TopicCQoS

TopicDQoS

Copy

right

Pris

mTe

ch, 2

015QoS Policies controlling

end-to-end properties follow a Request vs.

Offered

QoSDomain

Participant

DURABILITY

OWENERSHIP

DEADLINE

LATENCY BUDGET

LIVELINESS

RELIABILITY

DEST. ORDER

Publisher

DataWriter

PARTITION

DataReader

Subscriber

DomainParticipant

offered QoS

Topicwrites reads

Domain Idjoins joins

produces-in consumes-from

RxO QoS Policies

requested QoS

Cop

yrig

ht P

rism

Tech

, 201

5

Topics as Channels

Copy

right

Pris

mTe

ch, 2

015

We can think of a DataWriter and its matching DataReaders as connected by a logical typed communication channel

The properties of this channel are controlled by means of QoS Policies

At the two extreme this logical communication channel can be:

- Best-Effort/Reliable Last n-values Channel

- Best-Effort/Reliable FIFO Channel

Channel Properties

DR

DR

DR

TopicDW

Copy

right

Pris

mTe

ch, 2

015

Last n-values ChannelThe last n-values channel is useful when modelling distributed state

When n=1 then the last value channel provides a way of modelling an eventually consistent distributed state

This abstraction is very useful if what matters is the current value of a given topic instance

The Qos Policies that give a Last n-value Channel are:

- RELIABILITY = RELIABLE

- HISTORY = KEEP_LAST(n)

- DURABILITY = TRANSIENT | PERSISTENT [in most cases]

DR

DR

DR

TopicDW

Copy

right

Pris

mTe

ch, 2

015

The FIFO Channel is useful when we care about every single sample that was produced for a given topic -- as opposed to the “last value”

This abstraction is very useful when writing distributing algorithm over DDS

Depending on Qos Policies, DDS provides:

- Best-Effort/Reliable FIFO Channel

- FT-Reliable FIFO Channel (using an OpenSplice-specific extension)

The Qos Policies that give a FIFO Channel are:

- RELIABILITY = RELIABLE

- HISTORY = KEEP_ALL

fifo channel

DR

DR

DR

TopicDW

Copy

right

Pris

mTe

ch, 2

015

We can think of a DDS Topic as defining a group

The members of this group are matching DataReaders and DataWriters

DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views

In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side

This is not sufficient for certain distributed algorithms.

membershipDR

DR

DR

TopicDW

DataWriter Group View

DW

DW DRTopic

DWDataReader Group View

Copy

right

Pris

mTe

ch, 2

015

DDS provides built-in mechanism for detection of DataWriter faults through the LivelinessChangedStatus

A writer is considered as having lost its liveliness if it has failed to assert it within its lease period

fault detection

DW

DW DRTopic

DWDataReader Group View

Cop

yrig

ht P

rism

Tech

, 201

5

System Model

Copy

right

Pris

mTe

ch, 2

015

Partially Synchronous

- After a Global Stabilisation Time (GST) communication latencies are bounded, yet the bound is unknown

Non-Byzantine Fail/Recovery

- Process can fail and restart but don’t perform malicious actions

System Model

Copy

right

Pris

mTe

ch, 2

015The algorithms that will be showed next are implemented on

OpenSplice using the Moliere Scala API

All algorithms are available as part of the Open Source project dada

Programming environment

! DDS-based Advanced Distributed Algorithms Toolkit

!Open Source !github.com/kydos/dada

Cop

yrig

ht P

rism

Tech

, 201

5

Distributed Algorithms

Cop

yrig

ht P

rism

Tech

, 201

5

Group Management

Copy

right

Pris

mTe

ch, 2

015

A Group Management abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members

Ideally group management should also provide the ability to elect leaders

A Group Member should represent a process

Group management Abstraction

abstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)

// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)

// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)}

Copy

right

Pris

mTe

ch, 2

015

The group management algorithm that follows will provide eventually consistent views as well as eventual leaders

Whilst eventual consistency seems to weakens the abstraction, there are plenty of situations when this is actually more than enough.

It is also worth noticing that these algorithm as very efficient thanks to the eventual consistency assumption

Eventually consistent group views

Copy

right

Pris

mTe

ch, 2

015

To implement the Group abstraction with support for Leader Election it is sufficient to rely on the following topic types:

Topic types

enum TMemberStatus { JOINED, LEFT, FAILED, SUSPECTED};

struct TMemberInfo { long mid; // member-id TMemberStatus status;};#pragma keylist TMemberInfo mid

struct TEventualLeaderVote { long long epoch; long mid; long lid; // voted leader-id};#pragma keylist TEventualLeaderVote mid

Copy

right

Pris

mTe

ch, 2

015

Group Management The TMemberInfo topic is used to advertise presence and manage the members state transitions

Leader Election The TEventualLeaderVote topic is used to cast votes for leader election

This leads us to: Topic(name = MemberInfo, type = TMemberInfo, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}) Topic(name = EventualLeaderVote, type = TEventualLeaderVote, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}

Topics

Copy

right

Pris

mTe

ch, 2

015

Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader election

This makes it possible to:

- Let DDS provide our latest known state automatically thanks to the TransientLocal Durability

- No need for periodically asserting our liveliness. DDS will do that for our DataWriter

observation

Copy

right

Pris

mTe

ch, 2

015

(Eventual) Leader election

At the beginning of each epoch the leader is None Each new epoch a leader election algorithm is run

M1

M2

M0

crashjoin

join

join

epoch = 0 epoch = 1 epoch = 2 epoch = 3

Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0

At the beginning of each epoch the leader is None Each new epoch a leader election algorithm is run

Copy

right

Pris

mTe

ch, 2

015

An eventual leader election algorithm can be implemented by simply casting a vote each time there is an group epoch change

A Group Epoch change takes place each time there is a change on the group view

The leader is eventually elected only if a majority of the process currently on the view agree

Otherwise the group leader is set to “None”

(EventuaL) Leader Election algorithmobject EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt

val group = Group(gid)

group.join(mid)

group listen { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>

println(">> NewLeader = "+ l) } }}

Copy

right

Pris

mTe

ch, 2

015

To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take place

segregating groups

“1”“2”

“3” DDS Domain

Partition associated to the group with gid=2

Cop

yrig

ht P

rism

Tech

, 201

5

Barriers

Copy

right

Pris

mTe

ch, 2

015

Barriers are a useful construct in parallel and distributed computing used to coordinate the phases of a distributed computation

barrier abstraction

Process

Process

Process

Copy

right

Pris

mTe

ch, 2

015

A Barrier abstraction should provide a way to assert the desired size along with waiting for it

It is also useful to be able to list who is waiting on a given barrier

Barrier Abstraction

abstract class Barrier { def name: String def size: Int def watingList: List[Int]

def wait(): Unit def wait(timeout: Duration): Unit}

Copy

right

Pris

mTe

ch, 2

015

To implement the Barrier abstraction it is sufficient to rely on the following topic types:

Topic types

struct Barrier { string name; long long epoch; short count;};#pragma keylist Barrier name epoch

struct BarredProcess { string name; long long epoch; long pid;};#pragma keylist name epoch pid

Copy

right

Pris

mTe

ch, 2

015

P3

P2

P1

Copy

right

Pris

mTe

ch, 2

015

P3

P2

P1

Barrier = [(“Foo”, 1, 3)]

BarredProcess = [(“Foo”, 1, 2)]

BarredProcess = [(“Foo”, 1, 2), (“Foo”, 1, 1)]

BarredProcess = [(“Foo”, 1, 2), (“Foo”, 1, 1), (“Foo”, 1, 3)]

BarredProcess = []

Copy

right

Pris

mTe

ch, 2

015

P3

P2

P1

Barrier = [(“Foo”, 1, 3)]

BarredProcess = [(“Foo”, 1, 1)]

BarredProcess = [(“Foo”, 1, 1), (“Foo”, 1, 2)]

BarredProcess = [(“Foo”, 1, 1), (“Foo”, 1, 2), (“Foo”, 1, 3)]

BarredProcess = []

Barrier = [(“Foo”, 2, 3)]

...

Cop

yrig

ht P

rism

Tech

, 201

5

Wrap-up

Copy

right

Pris

mTe

ch, 2

015

DDS provide a computation/coordination model inspired by tuple spaces.

This is a symmetric and anonymous model of computation in which processes coordinate by reading and writing data in an eventual data space

While amenable to very high performance implementations this abstraction is quite powerful and greatly ease in the development of distributed systems

concluding remarks