chain replication in theory and in practice - snookles.com filechain replication in theory and in...
TRANSCRIPT
![Page 1: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/1.jpg)
Chain Replication in Theory and in Practice
Scott Lystig [email protected]
Gemini Mobile Technologies
September 30, 2010
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 1/22
![Page 2: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/2.jpg)
Overview
• Introduction to Chain Replication
• Overview of Hibari’s Implementation
• 30 seconds on the OSI Systems Fault Model
• OSI FCAPS: Fault management
• OSI FCAPS: Performance management
• Erlang-Specific Issues
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 2/22
![Page 3: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/3.jpg)
Chain ReplicationPapers
• “Chain Replication for Supporting High Throughput andAvailability” by Robbert van Renesse and Fred B. Schneider,USENIX OSDI 2004.
• “Object Storage on CRAQ: High-throughput chain replicationfor read-mostly workloads” by Jeff Terrace and MichaelJ. Freedman, USENIX Tech, 2009
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 3/22
![Page 4: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/4.jpg)
Chain ReplicationOverview
• A variation of master/slave replication
• State machine replication
• In contrast, quorum replication is more popular in open source
• Dynomite, Riak, Cassandra
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 4/22
![Page 5: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/5.jpg)
Chain Replication Messaging FlowsStrong Consistency: Read the Last Write
Client 1 Head ServerUpdate request
Client 2Tail Server Read request
Middle 1 Server
Replication
Middle 2 Server
Replication
Replication
Update reply
Read reply
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 5/22
![Page 6: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/6.jpg)
Hibari Overview
• Top layer: consistent hashing• Collections of billions of keys
• Middle layer: chain replication• Replicate a single key• Single collection of 0 – 50 million keys
• Bottom layer: storage brick• Store a single key• Single collection of 0 – 50 million keys
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 6/22
![Page 7: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/7.jpg)
Hibari Logical Architecture: View 1
Chain 0 Chain 1 Chain 2 Chain 3 Chain 4
Client
Key V
Key W Key X Key YKey Z
Bottom Layer: The Storage Brick 10 logical bricks 5 physical bricks
MiddleLayer:ChainReplication
Top Layer:ConsistentHashing
Head3 Head4
Tail4Tail3
Head2
Tail2
Head1
Tail1
Head0
Tail0
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 7/22
![Page 8: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/8.jpg)
Hibari Logical Architecture: View 2
Head0
PhysicalBrick 1
PhysicalBrick 0
PhysicalBrick 2
PhysicalBrick 3
PhysicalBrick 4
Chain 0
Chain 1
Chain 2
Chain 3
Chain 4
Client Consistent hashing by client: {Table, Key} -> Chain
Tail0
Head1 Tail1
Head2 Tail2
Head3 Tail3
Head4Tail4
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 8/22
![Page 9: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/9.jpg)
OSI/ISO Systems Fault Model: FCAPS
• Fault• Recognize, isolate, correct, and record faults
• Configuration• Gather, store, and modify device configuration state
• Accounting• Gather and store resource usage and billing data
• Performance• Gather, store, and analyze performance data
• Security• Gather, store, and modify device access control
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 9/22
![Page 10: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/10.jpg)
Performance ManagementResearch Papers vs. Planned Use
• van Renesse and Schneider (2004):• Performance experiments are simulated
• Terrace and Freedman (2009):• Performance measured on Emulab pc3000-type machines• Focused on read-mostly workload:
• Read:write ratio of 50:1 upwards to 150:1 (?)
• Gemini’s planned use• Low-to-mid-range x86 64 servers + RAID disk• Write-heavy workload: read:write ratio of 3:1 or worse
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 10/22
![Page 11: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/11.jpg)
Performance ManagementI/O Latency
• Hard disks are no fun. . . .
• Write latency• Append-only logs to minimize disk head seeks
• Log files are the only data structure
• fsync(2) latency + hard disks = slow
• Hibari default: all updates are durable
• All logical brick logs written to a common log• The common log aggregates fsync(2) requests
• Most Hibari bugs were in write/sync management code.
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 11/22
![Page 12: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/12.jpg)
Performance ManagementI/O latency
• Read latency• Read I/O generators: normal workload, brick repair, key
migration, data “scavenger”• Avoid blocking gen server processes with I/O
• “primer”: spawn proc to pre-read value data• Borrowed from Squid HTTP cache, Flash HTTP server
• Access by lexicographic vs. temporal orders
• Repair and migration: lexicographic order (by key)• Optimal read I/O pattern: temporal order (by update time)• Full repair of 2-3 TByte brick can take several days
• Rate control
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 12/22
![Page 13: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/13.jpg)
Performance ManagementI/O latency
• Other workload factors:• Chain reordering• OS-based read-ahead
Brick A
Brick B
Brick C
Brick A
Brick C
Brick C
Brick B
Brick C
Brick A
Brick B
Brick C Brick A
Brick B
Brick C
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 13/22
![Page 14: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/14.jpg)
Configuration ManagementThe “master”
• van Renesse and Schneider (2004) call for a “master”:
• Detects failures
• Reorders chains
• Informs clients about new chain state
• “In what follows, we assume the master is a single processthat never fails.”
• In prototype, multiple master processes + Paxos replication
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 14/22
![Page 15: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/15.jpg)
Configuration ManagementHibari Admin Server
• Admin Server is a single running entity• Active/standby OTP application• Static configuration: only 2 or 3 machines in cluster
• Monitors health of each logical brick
• Reconfigures chains when brick health changes
• Stores brick health history in Hibari logical bricks• Uses quorum replication• Avoid “chicken and the egg” bootstrap problem
• But . . . OTP app controller is vulnerable to network partition• Segue to the next FCAPS topic. . .
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 15/22
![Page 16: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/16.jpg)
Fault Management
• Detect Admin Server failures
• Detect brick failures
• Detect network partitions
• Detect brick failures during network partitions?
• Repair out-of-sync replicas
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 16/22
![Page 17: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/17.jpg)
Fault Management
• van Renesse and Schneider (2004): “Servers are assumed tobe fail-stop”
• Terrace and Freedman (2009): same
• Fail stop means . . . stop?• Except when network partitions heal
• Kill any brick that makes illegal health state transition
• Except in arbitrary message delays
• net kernel deadlock• busy dist port throttling
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 17/22
![Page 18: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/18.jpg)
Fault ManagementReplica repair/re-sync
• van Renesse and Schneider: add repairing/re-syncing server atend of chain
• Also: Replay all updates in same order
• But . . . non-trivial task with concurrent updates!• Update key K vs. repair K• Manage write(2) and fsync(2) delays on both bricks
• 99th percentile latency of 350+ milliseconds not uncommon
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 18/22
![Page 19: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/19.jpg)
Fault ManagementOther items
• Partition detector app• Use two physical networks to detect partition of one network
• Key timestamps . . . strictly increasing with each update• Clients: enable compare-and-swap atomic operations• Servers: quick way to determine key sync status
• File checksums• Crash brick when corruption is detected
• Replica placement• Very flexible, but no automatic “rack awareness”
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 19/22
![Page 20: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/20.jpg)
Fault/Configuration ManagementAutomatic Key Migration
• Change chain length, i.e. change replication factor
• Add/remove/reweight chains
• Automatically rebalance keys across chains
0.0 0.500 1.0
Chain 150.0%
Chain 250.0%
0.0 0.500 1.00.333 0.833
Chain 133.3%
Chain 233.3%
Chain 316.7%
Chain 316.7%
0.0 0.500 1.00.333 0.8330.7500.250 0.459 0.959
Chain 44.15%
Chain 125.0%
Chain 44.15%
Chain 48.33%
Chain 48.33%
Chain 225.0%
Chain 312.3%
Chain 312.3%
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 20/22
![Page 21: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/21.jpg)
Erlang-Specific Issues
• Messaging reliability• “Send and pray” — Joe Armstrong• . . . too easy to forget when buried in code
• Murphy’s law• A process rate-limited at 40 msgs/sec sends 85,000 messages
in 60 seconds?• Impossible . . . but it really happened
• Be extra-conscious of code with side-effects
• How many nodes can Erlang network distribution support?
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 21/22
![Page 22: Chain Replication in Theory and in Practice - snookles.com fileChain Replication in Theory and in Practice Scott Lystig Fritchie slfritchie@snookles.com GeminiMobileTechnologies September](https://reader031.vdocuments.mx/reader031/viewer/2022021916/5e17111ad3561560f85126d4/html5/thumbnails/22.jpg)
Thank You!
• ACM anonymous reviewers, Louise Lystig Fritchie, OlafHall-Holt, Satoshi Kinoshita, James Larson, Romain Lenglet,Jay Nelson, Joseph Wayne Norton, Gary Ogasawara, MarkRaugas, Justin Sheehy, Ville Tuulos, and Jordan Wilberding
• http://hibari.sourceforge.net/
• http://www.snookles.com/slf-blog/tag/hibari/
ACM Erlang Workshop 2010: Chain Replication in Theory and in Practice 22/22