chain replication in theory and in practice - snookles.com filechain replication in theory and in...

22
Chain Replication in Theory and in Practice Scott Lystig Fritchie [email protected] Gemini Mobile Technologies September 30, 2010 ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 1/22

Upload: others

Post on 19-Oct-2019

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Chain Replication in Theory and in Practice

Scott Lystig [email protected]

Gemini Mobile Technologies

September 30, 2010

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 1/22

Page 2: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Overview

• Introduction to Chain Replication

• Overview of Hibari’s Implementation

• 30 seconds on the OSI Systems Fault Model

• OSI FCAPS: Fault management

• OSI FCAPS: Performance management

• Erlang-Specific Issues

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 2/22

Page 3: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Chain ReplicationPapers

• “Chain Replication for Supporting High Throughput andAvailability” by Robbert van Renesse and Fred B. Schneider,USENIX OSDI 2004.

• “Object Storage on CRAQ: High-throughput chain replicationfor read-mostly workloads” by Jeff Terrace and MichaelJ. Freedman, USENIX Tech, 2009

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 3/22

Page 4: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Chain ReplicationOverview

• A variation of master/slave replication

• State machine replication

• In contrast, quorum replication is more popular in open source

• Dynomite, Riak, Cassandra

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 4/22

Page 5: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Chain Replication Messaging FlowsStrong Consistency: Read the Last Write

Client 1 Head ServerUpdate request

Client 2Tail Server Read request

Middle 1 Server

Replication

Middle 2 Server

Replication

Replication

Update reply

Read reply

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 5/22

Page 6: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Hibari Overview

• Top layer: consistent hashing• Collections of billions of keys

• Middle layer: chain replication• Replicate a single key• Single collection of 0 – 50 million keys

• Bottom layer: storage brick• Store a single key• Single collection of 0 – 50 million keys

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 6/22

Page 7: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Hibari Logical Architecture: View 1

Chain 0 Chain 1 Chain 2 Chain 3 Chain 4

Client

Key V

Key W Key X Key YKey Z

Bottom Layer: The Storage Brick 10 logical bricks 5 physical bricks

MiddleLayer:ChainReplication

Top Layer:ConsistentHashing

Head3 Head4

Tail4Tail3

Head2

Tail2

Head1

Tail1

Head0

Tail0

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 7/22

Page 8: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Hibari Logical Architecture: View 2

Head0

PhysicalBrick 1

PhysicalBrick 0

PhysicalBrick 2

PhysicalBrick 3

PhysicalBrick 4

Chain 0

Chain 1

Chain 2

Chain 3

Chain 4

Client Consistent hashing by client: {Table, Key} -> Chain

Tail0

Head1 Tail1

Head2 Tail2

Head3 Tail3

Head4Tail4

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 8/22

Page 9: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

OSI/ISO Systems Fault Model: FCAPS

• Fault• Recognize, isolate, correct, and record faults

• Configuration• Gather, store, and modify device configuration state

• Accounting• Gather and store resource usage and billing data

• Performance• Gather, store, and analyze performance data

• Security• Gather, store, and modify device access control

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 9/22

Page 10: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Performance ManagementResearch Papers vs. Planned Use

• van Renesse and Schneider (2004):• Performance experiments are simulated

• Terrace and Freedman (2009):• Performance measured on Emulab pc3000-type machines• Focused on read-mostly workload:

• Read:write ratio of 50:1 upwards to 150:1 (?)

• Gemini’s planned use• Low-to-mid-range x86 64 servers + RAID disk• Write-heavy workload: read:write ratio of 3:1 or worse

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 10/22

Page 11: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Performance ManagementI/O Latency

• Hard disks are no fun. . . .

• Write latency• Append-only logs to minimize disk head seeks

• Log files are the only data structure

• fsync(2) latency + hard disks = slow

• Hibari default: all updates are durable

• All logical brick logs written to a common log• The common log aggregates fsync(2) requests

• Most Hibari bugs were in write/sync management code.

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 11/22

Page 12: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Performance ManagementI/O latency

• Read latency• Read I/O generators: normal workload, brick repair, key

migration, data “scavenger”• Avoid blocking gen server processes with I/O

• “primer”: spawn proc to pre-read value data• Borrowed from Squid HTTP cache, Flash HTTP server

• Access by lexicographic vs. temporal orders

• Repair and migration: lexicographic order (by key)• Optimal read I/O pattern: temporal order (by update time)• Full repair of 2-3 TByte brick can take several days

• Rate control

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 12/22

Page 13: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Performance ManagementI/O latency

• Other workload factors:• Chain reordering• OS-based read-ahead

Brick A

Brick B

Brick C

Brick A

Brick C

Brick C

Brick B

Brick C

Brick A

Brick B

Brick C Brick A

Brick B

Brick C

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 13/22

Page 14: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Configuration ManagementThe “master”

• van Renesse and Schneider (2004) call for a “master”:

• Detects failures

• Reorders chains

• Informs clients about new chain state

• “In what follows, we assume the master is a single processthat never fails.”

• In prototype, multiple master processes + Paxos replication

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 14/22

Page 15: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Configuration ManagementHibari Admin Server

• Admin Server is a single running entity• Active/standby OTP application• Static configuration: only 2 or 3 machines in cluster

• Monitors health of each logical brick

• Reconfigures chains when brick health changes

• Stores brick health history in Hibari logical bricks• Uses quorum replication• Avoid “chicken and the egg” bootstrap problem

• But . . . OTP app controller is vulnerable to network partition• Segue to the next FCAPS topic. . .

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 15/22

Page 16: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Fault Management

• Detect Admin Server failures

• Detect brick failures

• Detect network partitions

• Detect brick failures during network partitions?

• Repair out-of-sync replicas

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 16/22

Page 17: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Fault Management

• van Renesse and Schneider (2004): “Servers are assumed tobe fail-stop”

• Terrace and Freedman (2009): same

• Fail stop means . . . stop?• Except when network partitions heal

• Kill any brick that makes illegal health state transition

• Except in arbitrary message delays

• net kernel deadlock• busy dist port throttling

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 17/22

Page 18: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Fault ManagementReplica repair/re-sync

• van Renesse and Schneider: add repairing/re-syncing server atend of chain

• Also: Replay all updates in same order

• But . . . non-trivial task with concurrent updates!• Update key K vs. repair K• Manage write(2) and fsync(2) delays on both bricks

• 99th percentile latency of 350+ milliseconds not uncommon

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 18/22

Page 19: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Fault ManagementOther items

• Partition detector app• Use two physical networks to detect partition of one network

• Key timestamps . . . strictly increasing with each update• Clients: enable compare-and-swap atomic operations• Servers: quick way to determine key sync status

• File checksums• Crash brick when corruption is detected

• Replica placement• Very flexible, but no automatic “rack awareness”

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 19/22

Page 20: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Fault/Configuration ManagementAutomatic Key Migration

• Change chain length, i.e. change replication factor

• Add/remove/reweight chains

• Automatically rebalance keys across chains

0.0 0.500 1.0

Chain 150.0%

Chain 250.0%

0.0 0.500 1.00.333 0.833

Chain 133.3%

Chain 233.3%

Chain 316.7%

Chain 316.7%

0.0 0.500 1.00.333 0.8330.7500.250 0.459 0.959

Chain 44.15%

Chain 125.0%

Chain 44.15%

Chain 48.33%

Chain 48.33%

Chain 225.0%

Chain 312.3%

Chain 312.3%

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 20/22

Page 21: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Erlang-Specific Issues

• Messaging reliability• “Send and pray” — Joe Armstrong• . . . too easy to forget when buried in code

• Murphy’s law• A process rate-limited at 40 msgs/sec sends 85,000 messages

in 60 seconds?• Impossible . . . but it really happened

• Be extra-conscious of code with side-effects

• How many nodes can Erlang network distribution support?

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 21/22

Page 22: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September

Thank You!

• ACM anonymous reviewers, Louise Lystig Fritchie, OlafHall-Holt, Satoshi Kinoshita, James Larson, Romain Lenglet,Jay Nelson, Joseph Wayne Norton, Gary Ogasawara, MarkRaugas, Justin Sheehy, Ville Tuulos, and Jordan Wilberding

• http://hibari.sourceforge.net/

• http://www.snookles.com/slf-blog/tag/hibari/

ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 22/22