data-centric reconfiguration with network-attached disks

Data-Centric Reconfiguration with Network-Attached Disks

Alex Shraer (Technion)

Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)

Preview

2

• The setting: data-centric replicated storage– Simple network-attached storage-nodes

• Our contributions:

1. First distributed reconfigurable R/W storage

2. Asynch. VS. consensus-based reconfiguration

Allows to add/remove storage-nodes dynamically

Enterprise Storage Systems

• Highly reliable customized hardware

• Controllers, I/O ports may become a bottleneck

• Expensive

• Usually not extensible– Different solutions for different scale– Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks)

3

Alternative – Distributed Storage

• Made up of many storage nodes• Unreliable, cheap hardware• Failures are the norm, not an exception

• Challenges: – Achieving reliability and consistency– Supporting reconfigurations

4

Distributed Storage Architecture

• Unpredictable network delays (asynchrony)

Cloud Storage

LAN/ WAN

readwrite

5

Storage ClientsDynamic,

Fault-prone

Fault-proneStorage Nodes

A Case for Data-Centric Replication• Client-side code runs replication logic

– Communicates with multiple storage nodes

• Simple storage nodes (servers)– Can be network-attached disks

Not necessarily PCs with disks Do not run application-specific code Less fault-prone components

– Simply respond to client requests High throughput

– Do not communicate with each otherIf storage-nodes communicate, their failure is likely to be correlated!Oblivious to where other replicas of each object are storedScalable, same storage node can be used for many replication sets

not-so-thinclient

thin storage

node

Real Systems Are Dynamic

7

The challenge: maintain consistency , reliability, availability

LAN/ WAN

reconfig{–A, –B}

A

B CD

E

reconfig {–C, +F,…, +I}

F

G

I

H

Pitfall of Naïve Reconfiguration

8

A

B

C

D{A, B, C, D}

{A, B, C, D}{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D, E}

{A, B, C}

{A, B, C, D, E}

{A, B, C, D, E}

{A, B, C}

{A, B, C}

E

delayed

delayed de

layed

delay

ed

reconfig {+E}

reconfig {-D}

{A, B, C, D, E}

Returns “Italy”!

Pitfall of Naïve Reconfiguration

9

A

B

C

D{A, B, C, D, E}

{A, B, C}

{A, B, C, D, E}

{A, B, C, D, E}

{A, B, C}

{A, B, C}

E

write x “Spain”

read x

{A, B, C, D, E}

X = “Italy”, 1

X = “Italy”, 1

X = “Spain”, 2

X = “Spain”, 2

X = “Spain”, 2

X = “Italy”, 1

X = “Italy”, 1

X = “Italy”, 1

Split Brain!

Reconfiguration Option 1: Centralized

• Can be automatic – E.g., Ursa Minor [Abd-El-Malek et al., FAST 05]

• Downtime – Most solutions stop R/W while reconfiguring

• Single point of failure– What if manager crashes while changing the system?

10

Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am

Virtually Yours,Moshe Barak

Reconfiguration Option 2: Distributed Agreement

• Servers agree on next configuration– Previous solutions not data-centric

• No downtime• In theory, might never terminate [FLP85]

• In practice, we have partial synchrony so it usually works

11

Reconfiguration Option 3: DynaStore [Aguilera, Keidar, Malkhi, S., PODC09]

12

• Distributed & completely asynchronous

• No downtime

• Always terminates

• Not data-centric

In this work: DynaDisk dynamic data-centric R/W storage

13

1. First distributed data-centric solution– No downtime

2. Tunable reconfiguration method– Modular design, coordination is separate from data– Allows easily setting/comparing the coordination method– Consensus-based VS. asynchronous reconfiguration

3. Many shared objects – Running a protocol instance per object too costly– Transferring all state at once might be infeasible– Our solution: incremental state transfer

4. Built with an external (weak) location service– We formally state the requirements from such a service

Location Service• Used in practice, ignored in theory

• We formalize the weak external service as an oracle:

• Not enough to solve reconfiguration

14

• oracle.query( ) returns some “legal” configuration

• If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration

The Coordination Module in DynaDisk

Storage devices in a configuration conf = {+A, +B, +C}

zx

y next config:

zx

y next config:

zx

y next config:

A B C

Distributed R/W objectsUpdated similarly to ABD Distributed “weak snapshot” object

API: update(set of changes)→OKscan() → set of updates

15

Coordination with Consensus

zx

y next config:

zx

y next config:

zx

y next config:

A B C

reconfig({–C}) reconfig({+D})

Consensus+D–C

+D+D+D

+D+D +D

update :

scan: read & write-back next config from majority• every scan returns +D or

16

Weak Snapshot – Weaker than consensus• No need to agree on the next configuration, as long as

each process has a set of possible next configurations, and all such sets intersect– Intersection allows to converge and again use a single config

• Non-empty intersection property of weak snapshot:– Every two non-empty sets returned by scan( ) intersect– Example: Client 1’s scan Client 2’s scan

{+D} {+D} {–C} {+D, –C} {+D} {–C}

Consensus

17

Coordination without consensus

zx

y next config: z y next config: z y next config:

A B C

reconfig({–C}) reconfig({+D})

update :

scan: read & write-back proposals from majority (twice)

CAS({–C}, , 0)

+D

CAS({–C}, , 1)+D

–C

WRITE ({–C}, 0)OK OK

2221 1 10 0 0–C

Tracking Evolving Config’s• With consensus: agree on next configuration

• Without consensus – usually a chain, sometimes a DAG:

19

A, B, C A,B,C,D+D C

A,B

A, B, D

A, B, C

+D

+D C

C

A,B,C,D

A, B, D

Inconsistent updates found

and merged

weak snapshot

scan() returns {+D, -C}

scan() returns {+D}

All non-empty scans intersect

Consensus-based VS. Asynch. Coordination• Two implementations of weak snapshots

– Asynchronous– Partially synchronous (consensus-based)

• Active Disk Paxos [Chockler, Malkhi, 2005]• Exponential backoff for leader-election

• Unlike asynchronous coordination, consensus-based might not terminate [FLP85]

• Storage overhead– Asynchronous: vector of updates

• vector size ≤ min(#reconfigs, #members in config)– Consensus-based: 4 integers and the chosen update– Per storage device and configuration

20

Strong progress guarantees are not for free

Consensus-based

Asynchronous (no consensus)

0 1 2 50

50

100

150

200

250

300

350

400

450

ms.Number of simultaneous reconfig operations

Average write latency

1 2 50

100

200

300

400

500

600

700

ms.

Average reconfig latency

Number of simultaneous reconfig operations

Significant negative

effect on R/W latency

Slightly better,much more predictable

reconfig latency when many reconfig execute

simultaneously

The same when no

reconfigurations21

Future & Ongoing Work

• Combine asynch. and partially-synch. coordination

• Consider other weak snapshot implementations– E.g., using randomized consensus

• Use weak snapshots to reconfigure other services– Not just for R/W

22

Summary• DynaDisk – dynamic data-centric R/W storage

– First decentralized solution– No downtime– Supports many objects, provides incremental reconfig– Uses one coordination object per config. (not per object)– Tunable reconfiguration method

• We implemented asynchronous and consensus-based• Many other implementations of weak-snapshots possible

• Asynchronous coordination in practice:– Works in more circumstances → more robust– But, at a cost – significantly affects ongoing R/W ops

23

data-centric reconfiguration with network-attached disks

Documents

storage nodesunreliable

art storage

otherif storagenodes

distributed storage

thinclientthin storage

ewrite x spainread x

reconfiguration option

datacentric replicationclient