distributed systems cs6421 · ssd random read 1.7 days a normal weekend read 1 mb sequentially from...

41
Distributed Systems CS6421 Storage: Scale and Fault Tolerance Prof. Tim Wood

Upload: others

Post on 12-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Distributed Systems CS6421

Storage: Scale and Fault Tolerance

Prof. Tim Wood

Page 2: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Our picture of the cloud…Data centers connected around the world…

!2

Data Center

Page 3: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Our picture of the cloud…Filled with lots of servers…

!3

Data Center Racks of servers

Page 4: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Our picture of the cloud…Connected with routers, switches, and middle boxes

!4

IDS Monitor

SDN Controller

Page 5: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Our picture of the cloud…Each hosting VMs or containers

!5

Linux

App

Hypervisor

App

Windows

App AppApp

App

Host OS (Linux)

App App App

Page 6: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Resources?CPU, RAM, and Disks

- How is storage different?

!6

Linux

Hypervisor

App

Windows

App AppAppApp

Hardware

Page 7: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Numbers you should know…How long to…

L1cachereference.........................0.5nsBranchmispredict............................5nsL2cachereference...........................7nsMutexlock/unlock...........................25nsMainmemoryreference......................100nsCompress1KbyteswithZippy.............3,000ns=3µsSend2Kbytesover1Gbpsnetwork.......20,000ns=20µsSSDrandomread........................150,000ns=150µsRead1MBsequentiallyfrommemory.....250,000ns=250µsRoundtripwithinsamedatacenter......500,000ns=0.5msRead1MBsequentiallyfromSSD*.....1,000,000ns=1msDiskseek...........................10,000,000ns=10msRead1MBsequentiallyfromdisk....20,000,000ns=20msSendpacketCA->Netherlands->CA....150,000,000ns=150ms

!7

???

Page 8: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Numbers you should know…

!8

also: https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Page 9: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Scale it up…After multiplying everything by a billion…L1cachereference0.5sOneheartbeat(0.5s)L2cachereference7sLongyawnMainmemoryreference100sBrushingyourteeth

Send2Kbytesover1Gbpsnetwork5.5hrFromlunchtoendofworkday

SSDrandomread1.7daysAnormalweekendRead1MBsequentiallyfrommemory2.9daysAlongweekendRoundtripwithinsamedatacenter5.8daysAmediumvacationRead1MBsequentiallyfromSSD11.6daysWaiting2weeksforadelivery

Diskseek16.5weeksAsemesterinuniversityRead1MBsequentiallyfromdisk7.8monthsAlmostproducinganewhumanbeingTheabove2together1year

SendpacketCA->Netherlands->CA4.8yearsGoingtouniversityforBSdegree

!9

???

Page 10: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Accessing 1MB dataData in RAM:

- 3 days (250 microseconds) - 60 GBps

Data in SSD: - 2 weeks (1 millisecond) - 1 GBps

Data in HDD: - a year (20 milliseconds) - 100 MBps

Send 1MB over 10Gbps network: - length of this class (2 milliseconds) - 1.25 GBps

!10

Page 11: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Networked StorageThe added cost of using the network is relatively low

What are the benefits of using remote storage instead of local?

!11

Page 12: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Storage ServicesBlock storage (EBS)

- Access to the raw bytes of a remote disk - Unit of access: disk block (4KB) - Mount as the disk for a VM - Pros/Cons?

- Different types of underlying storage (SSD vs HDD) - Pay per GB but I need to reserve the space in advance, number of

IOPs? - Limited to 16TB per disk

!12

Page 13: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Storage ServicesBlock storage (EBS) Object storage (S3, DynamoDB)

- Get and Put “objects” in remote storage - Unit of access: a full object - Store static web content, data sets - Pros/Cons?

- Pay per object based on size, also per request, net BW? - Limit 5 TB per object

!13

Page 14: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Storage ServicesBlock storage (EBS) Object storage (S3, DynamoDB) Database (RDS)

- Store structured rows and columns of data - Unit of access: SQL query - Web applications requiring transaction support - Pros/Cons?

- AWS handles management

!14

Page 15: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Implementation?How would you build a Block/Object store?

- What abstraction layers? - What should the interface look like? - What traits do you optimize for?

What price could you sell it for?

!15

Page 16: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Our Block Store

!16

VM

HDD

Hypervisor

Manager- request disk

Host

TimPart

Block size:4KB

Cache

Cache

VM

Cache

VM

Hypervisor

CacheVM

Cache

VM

Hypervisor

CacheVM

Cache

VM

Hypervisor

CacheVM

Cache

SSD

HDD

Host

TimPart

Cache

SSD

Cache

Page 17: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Replication ImprovesPerformance

- Access the least loaded or most geographically close replica

Reliability - Failure only impacts one copy, can recover from others

!17

Page 18: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

How many replicas?What failures can I handle?

!18

Page 19: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

How many replicas?What failures can I handle?

!19

Page 20: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

How many replicas?What failures can I handle?

!20

Page 21: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

How many replicas?What failures can I handle?

!21

Page 22: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Types of FaultsCrash failure

- power outage, hardware crash

Content failure - incorrect value returned - could be permanent or transient - could be independent or coordinated

In practice, we make assumptions about how many failures we expect to occur at once

- “At most 1 node will crash at a time” - “At most 3 nodes will be corrupted at the same time”

!22

Page 23: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a crash failure?

- A server suddenly disappears

!23

Page 24: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a crash failure?

- A server suddenly disappears

!24

P2

P1Inputs

output = 42+2=4

x(crash

If we expect at most f faulty nodes, how many servers do we need total?

Page 25: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a content failure?

- A server starts producing an incorrect result

!25

P2

P1Inputs

output = ???

f=2. need f+2

5

4

P3 4

P2 5

Page 26: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a content failure?

- A server starts producing an incorrect result

Will the same solution work?

!26

P2

P1Inputs output = ???

f+1 replicas

2+2=4

2+2=7

Page 27: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a content failure?

!27

P2

P1

InputsP3

Majority Voter output = 4

2f+1 replicas

2+2=4

2+2=4

2+2=5

Page 28: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Fault Tolerance through ReplicationHow to tolerate a crash failure?

How to tolerate a content failure?

!28

P2

P1Inputs

output = 4

P2

P1

InputsP3

Majority Voter output = 4

f+1 replicas

2f+1 replicas

2+2=4

2+2=4

2+2=4

2+2=5x(crash

Page 29: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Detection is HardOr maybe even impossible

How long should we set a timeout?

How do we know heart beat messages will go through?

What if the voter is wrong?

!29

Page 30: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Agreement without VotersWe can't always assume there is a perfectly correct voter to validate the answers

Better: Have replicas reach agreement amongst themselves about what to do

- Exchange calculated value and have each node pick winner

!30

A

C

B445

44

5

Replica Receives Action

A 4, 4, 5 = 4

B 4, 4, 5 = 4

C 4, 4, 5 = 4?

Page 31: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Reaching Agreement3 Armies gather to attack a city

- But one is led by a traitor! - We have 2f+1 replicas

The assault will only succeed if at least 2 armies attack at the same time

- I think we should… 1 = attack, 0 = retreat!

!31

A

C

B011

10

1

Replica Receives Action

A 1, 0, 1 Attack!

B 1, 0, 1 Attack!

C 1, 0, 1 ???

Page 32: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Reaching Agreement3 Armies gather to attack a city

- But one is led by a traitor! - We have 2f+1 replicas

The assault will only succeed if at least 2 armies attack at the same time

- I think we should… 1 = attack, 0 = retreat!

!32

A

C

B110

11

0

Replica Receives Action

A 1, 1, 0 Attack!

B 1, 1, 0 Attack!

C 1, 1, 0 ???

Page 33: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Reaching Agreement3 Armies gather to attack a city

- But one is led by a traitor! - We have 2f+1 replicas

The assault will only succeed if at least 2 armies attack at the same time

- I think we should… 1 = attack, 0 = retreat!

!33

A

C

B010

10

0

Replica Receives Action

A 1, 0, 0 Retreat!

B 1, 0, 0 Retreat!

C 1, 0, 1 ???

Page 34: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Byzantine Generals Problem3 Armies gather to attack a city

- But one is led by a traitor! - We have 2f+1 replicas

The assault will only succeed if at least 2 armies attack at the same time

- I think we should… 1 = attack, 0 = retreat!

!34

A

C

B011

10

0

Replica Receives Action

A 1, 0, 1 Attack!

B 1, 0, 0 Retreat!

C 1, 0, 1 ???

Majority voting doesn't work if a replica lies!

Page 35: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Byzantine Generals Solved*!Need more replicas to reach consensus

- Requires 3f+1 replicas to tolerate f byzantine faults

Step 1: Send your plan to everyone Step 2: Send learned plans to everyone Step 3: Detect conflicts and use majority

!35

A

C

B

x y

Dz

Replica Receives Majority

AA: (1,0,1,1)B: (1,0,0,1)C: (1,1,1,1)D: (1,0,1,1)

A: 1B: 1C: 1D: 1

BA: (1,0,1,1)B: (1,0,0,1)C: (0,0,0,0)D: (1,0,0,1)

A: 1B: 1C: 0D: 1

Page 36: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Quorum Based SystemsQuorum: a set of responses that agree with each other of a particular size

Crash fault tolerance: Need a quorum of 1 - f others can fail (thus need f+1 total replicas)

Data fault tolerance: Need a quorum f+1 - f others can fail (thus need 2f+1 total replicas) - Need a majority to determine correctness

!36

Page 37: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Quorum4 Replicas

- Some nodes might be temporarily offline

How many replicas to send to for a read or write? - Must wait for a response from each one

!37

5,2 7, 3 5,2 3,1

Write: 7 Read

Page 38: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Dynamo DBObject Store from Amazon

- Technical paper at SOSP 2007 conference (top OS conference)

Stores N replicas of all objects - But a replica could be out of date! - Might saved across multiple data centers - Gradually pushes updates to all replicas to keep in sync

When you read, how many copies, R, should you read from before accepting a response?

When you write, how many copies, W, should you write to before confirming the write?

!38

Page 39: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

Dynamo DBRead and Write Quorum size:

R=1 — fastest read performance, no consistency guarantees W = 1 — fast writes, reads may no be consistent R = N/2+1 (reading from majority) R=1, W = N - slow writes, but reads are consistent R=N, W=1 - slow reads, fast writes, consistent Standard: N=3, R=2, W=2

- if F=1,

!39

Page 40: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

QuorumHow do N, R, and W affect:

Performance: Consistency: Durability: Availability:

DynamoDB lets the user tune these for their needs

!40

Page 41: Distributed Systems CS6421 · SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium

Tim Wood - The George Washington University - Department of Computer Science

QuorumHow do N, R, and W affect: Performance:

- low R or W -> higher performance - for a fixed R or W: higher N gives higher performance - higher N means more synchronization traffic

Consistency: - R + W > N — guarantees consistency - R+w << N — much less likely to be consistent

Durability: - N=1 vs N=100, more N = more durability

Availability: - Higher N or W => higher availability

!41