1 alexander a. shvartsman university of connecticut, usa specifying, reasoning about, optimizing,...

1

Alexander A. Shvartsman

University of Connecticut, USA

Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed Systems

FRIDA 2014, Vienna, Austria

2

“Three R’s” (Lesen, Schreiben, Rechnen)

Reading, ’Riting, and ’Rithmetic Underlay much of human intellectual activity Venerable foundation of computing technology

With networking, communication became a major activity Email – electronic counterpart of postal service

Yet, it is natural to deal with reading, writing, and computing In the Web, a browser app may load (i.e., read) a page,

perform computation, and save (i.e., write) the results In distributed databases we retrieve and store data, and

rarely talk about sending and receiving data Arguably, it is also easier to develop distributed algorithms

with readable/writeable objects, than to use message passing

3

Sharing Memory in a Networked System

Let’s place a shareable object at a node in a network Not fault-tolerant – single point of failure Not efficient – performance bottleneck Not very available, does not provide longevity, etc…

4


So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance

5


With replication come challenges: How to preserve consistency while managing replicas? What kind of consistency? How to provide it? How to use it? a

ab

6

Sharing Memory in Networked Systems

Let’s place a shareable object at a node in a network Not fault-tolerant – single point of failure Not efficient – performance bottleneck Not very available, does not provide longevity, etc…

So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance

With replication come challenges: How to preserve consistency while managing replicas? What kind of consistency? How to provide it? How to use it?

7

Consistency

Easiest for users: a single copy view Sequence of operations; a read sees all previous writes Captured by atomicity [Lamport] / linearizability [Herlihy Wing]

Not cheap to implement even without general updates Cheapest to implement: a read sees a subset of prior writes

Not the most natural semantic for the users Additional complications in dynamic systems

Ever-changing sets of replicas and participants Crashes never stop, timing variations persist Evolving topology, ultimately mobility

8

Atomicity / Linearizability [Lamport / Herlihy Wing ]

“Shrink” the interval of each operation to a serialization point so that the behavior of the object is consistent with its sequential type

read(0)

write(8)

read(8)

Time

read(8)

read(0)

write(8)

read(8)

Time

read(0)

write(8)

read(8)

Time

9

Valid (atomic) executions:

Invalid executions:

More Atomicity Examples

read( )

write(8)

ret(0)

ack( )

read( )

write(8)

ret(0) read( ) ret(8)

TimeTime

read( )

write(8)

ret(0)

ack( )

read( )

write(8)

ret(8) read( ) ret(0)

TimeTime

read( ) ret(0)

10

Consistency Polemics

Parallel / distributed systems Performance Speed-up

Distributed theory focus Fault-tolerance Consistency

User:

Yes, mine is wrong…

But it is fast!

Yes, mine is slow…

But it is correct!

I wish they reconciled…

11

Using Majorities/Quorums for Consistency

Consistency of replicated data: using intersecting sets Starting with Gifford (79) and Thomas (79) Upfal and Wigderson (85)

Majority sets of readers and writers to emulate shared memory in a synchronous distributed setting

Vitanyi and Awerbuch (86)MW/MR registers using matrices of SW/SR registers

where rows and columns are read and written Attiya, Bar-Noy, and Dolev (91/95)

Atomic SW/MR objects in message passing systems, majorities of processors, minority may crash

Two-phase protocol (ABD)

12

Quorum Systems and Examples

A C

B

Majorities [Thomas79,Gifford79]

Matrix Quorums:Processor ids arranged in a matrix.

A quorum: Row U Column

Quorum system Q over P, a set of processor ids:Q = {Q1, Q2, … }• Qi P • Qi Qj ≠ Ø for all i, j

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Lemma: The join of quorum system Qa over Pa andsystem Qb over Pb ,Qa ⋈ Qb , is a quorum system over Pa U Pb .

13

Main Idea: (Logical) Timestamps and Quorums

An object is represented as a pair (value, timestamp) A write records (new-value, new-timestamp) in a quorum A read obtains (value, timestamp) pairs from a quorum,

then returns the value with the largest timestamp

read(vmax)

(vmax,tmax

)write(v1)

(v1,t1)

write read

Time

14

“Readers Must Write”

If operations are concurrent and a reader simply returns the latest value, then atomicity can be violated:

Solution: If readers first help the writer to record the value in a quorum, then it is safe to return the latest value

Write(v1) . . . (happened to be slow)

v0

v1

Read returns v1 Read returns v0

v0

v0

v0

v0

15

Lastly: “Writers Must Read”

I assume you’re being facetious, Josef,

I distinctly yelled ‘second’ before you did!

• Writers can “read” before writing (and riding) to obtain the latest timestamp in order to compute a new timestamp

16

The ABD Algorithm [Attiya Bar-Noy Dolev]

17

The Complete Algorithm [Attiya Bar-Noy Dolev]

• Replica hosts respond to Get and Put requests

• Read and write uses identical two-phase communication patterns:• Get phase queries and obtains values from a quorum, • Put phase propagates values to a quorum.• The only difference is in what is sent out in the Put phase.

18

New Frontiers: Dynamic Atomic Memory

Goal: Develop and analyze algorithms to solve problems of coherent data sharing in dynamic distributed environments

“Dynamic” encompasses Changing sets of participants:

nodes come and go as they please Wide range of failures Asynchrony, timing variations Crashes, message loss, weak delivery guarantees Changes in network topology Processor mobility

19

Approach: Dynamic Quorums [Lynch Shvartsman]

Objects are replicated at several network locations To accommodate small, transient changes:

Use quorum configurations: members, quorums Maintains atomicity during “normal operation” Allows concurrency

To handle larger, more permanent changes: Reconfigure Maintains atomicity across configuration changes Any configuration can be installed at any time Reconfigure concurrently with reads/writes --

no heavy-handed view change

20

Algorithmic Ideas

Reconfigurable quorum systems Quorums maintain consistency during modest and

transient changes Reconfigurations accommodate more drastic and

permanent changes Read/write operations are frequent

Use quorum access and allow concurrency Isolate from reconfiguration

Reconfigurations are infrequent Use consensus to impose total order (Paxos) Optimistic dissemination without formal installation Conservative garbage collection of obsolete config-s

21

Related Other Approaches

Using consensus to agree on each operation [Lamport]

Performance overhead for each reads and write op Termination of operations depends on consensus

Use group communication service [Birman 87] with TO bcast [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96]

View change delays reads/writes One change may trigger view formation

DynaStore: Reconfiguration without consensus[Aguilera, Keidar, Malkhi, Shraer 11]

Initial quorum system, incremental adds/removes Changes yield DAGs of possibilities Reads/writes use ABD-like phases, traverse DAGs Assuming finite reconfigurations guarantees termination

22

GCS: Group Communication Services (a Detour)

Pioneering work of Ken Birman & Co. made GCSs one of the most important distributed system building blocks GCS: group membership + multicast communication Isis, Transis, Consul, Ensemble, Totem, Horus, Newtop, Relacs …

Commercial variants of Birman’s Isis used in French air traffic control system Stock exchange on Wall Street Swiss Electronic Bourse

GCSs for years defied sufficiently formal specification that would both Detail the utility of the services and Enable compelling reasoning about correctness

23

View-Synchrony [Fekete Lynch Shvartsman 01]

A new, simple formal specification for a partitionable view-oriented group communication service.

To demonstrate the value of our specification, we used it to construct an algorithm for an ordered-broadcast application that reconciles information derived from different views. Atomic memory is now easily obtained

Our algorithm is based on algorithms of Amir, Dolev, Keidar, Melliar-Smith and Moser.

Our specification has a simple implementation, based on the membership algorithm of Cristian and Schmuck.

24

View-Synchrony [Fekete Lynch Shvartsman 01]

We proved the correctness and analyzed the performance and fault-tolerance of the algorithm

Our work on GCS was (we believe) the first successful formal specification + rigorous proofs Safety: Invariants and simulation Conditional performance analysis Myla Archer @ NRL checked several invariants using

PVS – invaluable contribution to perfecting the proofs Despite perhaps only modest algorithmic advances the

paper was/is highly cited -- the value is in the proofs!

25

Our Ground-Up Approach: RAMBO

Reconfigurable Atomic Memory for Basic Objects[Lynch, Shvartsman 02], [Gilbert, Lynch, Shvartsman 10]

For small, transient changes, use quorum configurations Maintain atomicity and allow concurrency, a la ABD

For larger, permanent changes, reconfigure: Dynamically install any quorum system Guarantee atomicity across configurations

Guarantees: Linearizability in all cases Good latency in common cases High availability

Everything proceeds concurrently Use low-level gossip communication

26

Methodology

Specify algorithm Interacting state machines Using non-deterministic “gossip”

Show correctness/safety for arbitrary patterns of asynchrony assuming arbitrary crash-failures and message loss

Analyze performance for a subset of timed executions Bounded message delay, 0-time local processing Some “gossip” becomes deliberate, some periodic Non-failure of certain quorums for certain periods Reason about operation latency Of course none of this impacts safety

27

Reconfigurable Atomic Memory for Basic Objects

Global service specification Implementation:

Main algorithm + “recon” service Recon service:

“Advance reconnaissance” Consistent sequence of configurations Loosely coupled

Main algorithm: Reading, writing Receives, disseminates new configuration

information; no explicit installation Reconfigures: upgrade to new and remove old Reads/writes may use several configurations

Net

RAMBO

Recon

RAMBO

28

Architectural View

Communication Network

Node i

Joineri

R/Wi

Node j

Recon

Joinerj

R/Wj

Application

RAMBO

29

High-Level Functions

Joiner Introduces new participants to the system

Reader-Writer Routine read and write operations Two-phased algorithm using all “known” configurations Using tags

Recon Chooses new next configuration, e.g., using Paxos Informs members of the previous configuration

Configuration upgrade (“packaged” with Reader-Writer) Identify and remove obsolete configurations

(garbage collection)

30

Configurations and Reconfiguration

Configuration: quorum system Collection of subsets of replica host ids

where any two subsets intersect Or collection of read-quorums and write-

quorums, where any read-quorum intersects with any write-quorum

Reconfiguration process involves two decoupled steps Recon: Emitting a new configuration, then, at

some later time Garbage-collecting obsolete configurations while

“upgrading” to the latest (locally) known configuration

No constraints on the membership of distinct quorum systems

Q1 Q2 Q3

…

31

Consensus

Recon

Net

Implementation of Recon

Uses distributed consensus to determine new configurations 1,2,3,… Note: when the universe

of configurations isfinite and known, then consensus is not neededeven with unboundedreconfiguration [GeoQuorums]

Members of old configuration may propose a new config.

Proposals reconciled using consensus Consensus is a fairly heavy mechanism, but it is

Used only for reconfigurations, which are infrequent Does not delay Read/Write operations

32

Configurations and Config Maps

Configuration c members(c) -- set of members of configuration c read-quorums(c), write-quorums(c) -- sets of quorums

Configuration map cm mapping from naturals to configurations

cm(k) is configuration k

Can be defined (c), undefined (), garbage-collected (±)

± ± c c c c ... ...

GC’d Defined Mixed Undefined

c

33

Configuration Upgrade [Gilbert, Lynch, S 10]

Reconfigure to last configuration in a contiguous segment

Phase 1: Informs write-quorum of cj … ck-1 about ck Collects (value,tag) from read-quorums of cj … ck-1

Phase 2: Propagates latest (value, tag) to a write-quorum of ck

Garbage-collect: Set cmap(j…k -1) to ± Constant-time upgrade regardless of the number of

obsolete configurations (conditioned on failures) Maintains good read/write latency during system instability

or frequent reconfigurations

± cj . . . ck . . .. . . . . .

34

Configuration Map Changes (Local View)

c0

c0 c1

c0 c1 c2 ck

± c1 c2 ck

± ± c2 ck

TIME

. . .

. . .

. . .

. . .

. . .

± ± ± c3 ck

. . .

± ± ± ± ± c c c c . . .

. . .

35

On to Reads and Writes: Values and Tags

Each value v has an associated tag t (logical timestamp) Tag is made up of the sequence-processor pair

Reads: a set of value-tag pairs is obtained the result is the value with the maximum tag

Writes: a set of value-tag pairs is obtained new-value is propagated with a new-tag that is a

lexicographic increment of tag :

new-tag := tag.seq + 1, pid

36

Dynamic Reader-Writer and Recon

The work is split between Reader-Writer and Recon Recon emits consistent configurations Reader-Writer processes run two-phase quorum-based

algorithm, with all “active” configurations Background “gossip” builds fixed-points If Recon emits new configuration, Reader-Writer continues

reads/writes in progress, until fixed-point is reached

Net

R/Wi Recon R/Wj

37

Processing Reads and Writes

Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap. Query: Gets “fresh” value, tag, and cmap information from

read-quorums Propagation: Gives latest (value,tag) to write-quorums Both phases: Extend op.cmap with newly-discovered

configurations that now must also be involved. Each phase ends with a fixed point, involving all the

configurations currently in op.cmap

Read or Write

Propagation PhaseQuery Phase

StartQuery

StartProp

EndQuery

EndProp

38

Methodology

Algorithms are presented formally, using interacting state machine models: Input/Output automata) service specifications algorithm descriptions models for applications

Safety: rigorous proof of correctness (atomicity) for arbitrary patterns of asynchrony and change

Conditional performance analysis E.g., when message latency < d, quorum configurations

are “viable”, then read and write operations take time between 4d and 8d, under reasonable “steady-state” assumptions.

…

Net

…

3939

Input Output Automata (IOA)

Input Output Automata [Lynch & Tuttle] A language and formalism based on simple

mathematical foundation Elegant, suitable for analysis, well documented Supports composition, abstraction Model system components as a collection of

interacting state machines Allows rigorous reasoning about system properties Widely used in research – hundreds of algorithms

40

Example Spec’n: Asynchronous Lossy Channel

Input sendi,j(m)Output recvj,i(m)

Internal lose(m)

Channeli,j

41

Details: Reader-Writer: Send and Recv Code

Specification of gossip using Input/Output Automata of [Lynch Tuttle]

Send

Receive

42

Details: Reader-Writer Fixed Points

Phase 1 fixed point

Specification of fixed points using Input/Output Automata

Phase 2 fixed point

43

Showing Read/Write Atomicity

We show atomicity using a partial order For be any sequence of reads and writes

Let be an irreflexive PO of all op-s in To prove atomicity show that satisfies:

For any operation , finitely many op-s If precedes , then not If is a write then either or Any read returns value written by last write, per

(or init value if no such writes) [Lynch, Lemma 13.16]

Rigorous proof that in RAMBO the max-tag of reads and new-tag of writes, lexi-ordered, provide such a PO

44

Some Latency Analysis Details

Restrict attention to a subset of timed executions Reminder: Read and write operations are not affected by

Recon delays or non-termination Configuration upgrade (garbage collection) takes 4d If reconfigurations are “rare” -- operations take 4d If configurations are in “steady state” -- operations take 8d Logarithmic time “catch-up” after a bursty period of

“bad timing behavior” A recovering node joins quickly after a long absence

45

Implementation

Experimental system implementations [Musial 07] Platform for refinement, optimization, tuning Observe of algorithms in a local area setting Cluster with 16+/- Linux machines & fast switch

Developed by manually translating the Input/Output Automata specification to Java code Precise rules are followed to mitigate error introduction

during translation Rigorous proofs [Georgiou, Musial, Sh., Sonderegger 07, 11]

Next steps: Specification in Tempo [Lynch Michel S 08] (Timed IOA) Code generation ([Georgiou Lynch Mavrommatis Tauber 09])

46

Optimizations – Rigorously Proved

Parallel upgrade [Gilbert Lynch S 10]

Improving communication efficiency [Georgiou, Musial, S 06]

Explicit “leave” service Incremental gossip for long-lived data

Improving liveness & gossip efficiency [Gramoli, Musial, S 05]

Non-deterministic heuristic for prodding slow operations Constrained gossip patterns

Providing complete shared memory [Georgiou, Musial, S 05]

Domain-oriented Rambo Optimizing performance for groups of related objects

47

Optimization and Development Methodology

Atomicityproperties

AbstractRambo

Rambowith gracefuldepartures

Long-LivedRambo

Proof

DerivationRunningSystem

Simulation

SimulationManual derivation

[Musial 07]or

semi-automatedcode generation

[Georgiou + 09,10]

[Lynch, S 02][Gilbert, L, S 10]

[Georgiou,Musial, S 04, 06]

48

Forward Simulation

Let C (low-level, concrete) and A (high-level, abstract) be two systems.The goal is to show that “C implements A” by “trace inclusion”

An execution is a finite or infinite alternating sequence s0a1s1a2s2… of states and actions, where s0 is a start state

A trace is obtained by removing all states and internal actions Forward simulation from C to A

states(C) x states(A) If s start(C) then (s) start(A) If (s,a,s’) transitions(C), then there exists an execution fragment

= (s), …, (s’) such that trace(a) = trace()

Theorem: If there exists a simulation relation from C to A, then traces(C) traces(A).

This implies that any safety property of A is also satisfied by C

49

Simulation (or Abstraction) Function

Simulation function F maps each state of specification C (concrete) to a state of specification A (abstract)

F is required to satisfy the following two conditions1. If s is an initial state of C then F(s) is an initial state of A2. If s and F(s) are reachable states of C and A

respectively, and (s, p, s') is a step of C, then there is a step ’p of A from F(s) to F(s'), having the same trace

A:

C:

F(s) F(s')

s s'

F F

p

’p

50

Optimization: Improving performance

Long-Lived RAMBO: Graceful Leave + Incremental Gossip Rigorous proof of correctness by simulation Performance study

[Georgiou, Musial, Shvartsman 06]

51

Complete Shared Memory

Atomicity is compositional Naïve approach:

Implement a singleton memory location Get a complete shared memory by running several

implementations Not so fast! Literally.

52

Complete Shared Memory

Domain-oriented reconfigurable atomic memory Optimizing performance for groups of related objects

- Composition

- Domain

53

Atila: Atomicity through Indirect Learning

Abstract Rambo assumes all-to-all connectivity Atila: “Atomicity Through Indirect Learning Algorithm”

Indirect learning enables progress without any routing protocol or complete connecticivity[Konwar,Musial, Nicolaou, S 07]

Prove atomicity guarantees Conditional analysis of operation latency Identify deployment settings that

lead to reasonable communication cost

54

RDS: Rambo Paxos

Reconfigurable Distributed Storage for Dynamic Networks RDS [Chockler, Gilbert, Gramoli, Musial, Shvartsman 09]

How can liveness of Rambo get hurt? Reconfiguration bursts and Rapid rate of failures

Solution: Paxos-based procedure for installing new configurattons Integrate configuration upgrade with installation Overlap phases of Paxos with “garbage collection” Obsolete configuration are removed quicker Reads and writes use only 1 or 2 configurations

55

Federated Array of Bricks (FAB)

Storage system developed and evaluated at HP Labs [Saito Frølund Veitch Merchant Spence 05]

Distributes workload and handles failures and recoveries without disturbing client requests

Read or write protocol involves majority quorums of storage “bricks” following the Rambo algorithm

Evaluations of the implementation showed FAB performance is similar to centralized solutions, While offering at the same time continuous service and

high availability

56

DynaDisk Implementation

Data-center read/write storage system Allows add/remove of storage devices on-the-fly Based on DynaStore, but with and without consensus [Shraer Martin Malkhi Keidar 10]

57

GeoQuorums

Dynamic atomic read/write memory for mobile settings [Dolev, Gilbert, Lynch, Shvartsman, Welch 04, 05]

Use Rambo architecture over Virtual Node layer Nodes: fixed geographical locations called Focal Points

Centers of populated, compact geographical areas: Traffic intersections, buildings, bridges, points-of-interest

Continuously populated, thus able to maintain state Implementations:

Virtual Node layer over the physical mobile network Atomic read/write memory over the Virtual Node layer

58

GeoQuorums

Mobile nodes Focal points –

implemented asVirtual Nodes

Quorums are definedover focal points

Use GPS as timestamps Fast(er) read/write operations

Single phase writes One or two phase reads

Simplified, consensus-free, reconfiguration Two-phase algorithm using fixed configurations Can be motivated by performance: e.g., if writes are

frequent, install smaller write quorums

59

Read-Modify-Write Storage Systems

RMW is strictly stronger than atomic read and write Some storage systems implement atomic RMW

operations. This type of consistency is expensive, and requires at its core atomic updates Reduce parts of the system to a single-writer model

(e.g., Microsoft’s Azure) Depend on clock synchronization hardware (Google’s

Spanner) Rely on complex mechanisms for resolving event

ordering such as vector clocks (Amazon’s Dynamo)

60

Tempo @ www.veromodo.com

A toolkit for the Timed Input/Output Automata formalism Computer-aided design framework

Specification language Suite of tools used to specify systems and

reasons about them (interfaces to UPPAAL and PVS)

Eclipse GUI (and alternatively a command line interface)

Tempo is used to Specify distributed algorithms / protocols Verify their correctness / safety with theorem

provers / model checkers Simulate their behavior (native simulator)

Tempo availability: beta at www.veromodo.com

61

Tempo – bird’s eye view

SyntaxError Reports

SyntaxError Reports

Parser

VocabulariesAutomataFiles

Syntax TreeSemanticAnalysis

Back-endlibraries

Internalrepresentation

Back-endPlugin loader

Simulator(interactive!)

UppaalModel checker

ProbabilisticModel checker

PVSGeneration

CodeGeneration

DeploymentOptimization

Legend

Existing back-end tools:

Proposed back-end tools:

62

Tempo Availability (beta for research use)

• Pure Java

• Runs on the Eclipse Rich Client Platform

• Runs on multiple platforms

• Mac OS X

• Windows

• Linux

63

Problems & Futures

Use Tempo (www.veromodo.com) Complete specifications Code generation

“Weaker” consistency – crafting atomicity? Local optimizations that preserve global atomicity

Practical settings Recoveries, reconfiguration, and self-stabilization Indirect learning in mobile settings

Reconfiguration When to reconfigure? How to choose configurations? Optimization of deployment [Sonderegger 11]

Going beyond read/write objects

64

Thank You!

Questions and Discussion

?

65

Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Distributed Systems

Consistent shareable data services supporting atomic (linearizable) objects provide convenient building blocks for dynamic distributed systems. In general it is challenging to combine the guarantee of linearizability with efficiency in practical systems subject to perturbations in the computing medium. We describe our work on specification and implementation of consistent data services for network settings. We present a general framework that can be tailored to yield services for various target network settings. In these services, long-lived data availability is ensured through replication, while atomicity is ensured by performing reads and writes using evolvable quorum configurations. We describe on-the-fly reconfiguration that only modestly interferes with on-going operations, in particular, the termination of reads and writes does not depend on the termination of reconfiguration. Safety (atomicity) holds for arbitrary patterns of asynchrony, processor crashes, and link failures. We describe specific examples of specification, rigorous reasoning about correctness, provable optimizations, and methodical implementations of consistent data services in distributed systems.

1 alexander a. shvartsman university of connecticut, usa specifying, reasoning about, optimizing,...

Documents