building reliable systems with apache bookkeeper
DESCRIPTION
A presentation at the Barcelona JUG 19/06/2014TRANSCRIPT
![Page 1: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/1.jpg)
Building reliable systems with
Apache BookKeeper
Matthieu MorelIvan Kelly
![Page 2: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/2.jpg)
Challenges in distributed systems
![Page 3: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/3.jpg)
Ryan Lintleman cc-by-nc 2.0 https://flic.kr/p/5XNGow
Expect failures
![Page 4: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/4.jpg)
up to 10% annual failure rates for disks/servers
![Page 5: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/5.jpg)
The network is reliable
NOT
Jonathan Briggs - cc by 2.0 https://flic.kr/p/bnAxuz
![Page 6: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/6.jpg)
SymptomsAlex Proimos cc-by-2.0 https://flic.kr/p/bt29wL
![Page 7: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/7.jpg)
Problem 1: not available
![Page 8: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/8.jpg)
Problem 1: not available
![Page 9: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/9.jpg)
Problem 2: inconsistencies
![Page 10: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/10.jpg)
CAP
consistency
partition toleranceavailability
zookeeper / bookkeeper
cassandra
![Page 11: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/11.jpg)
More issues...
cc-by-2.0 https://flic.kr/p/8j57SG
![Page 12: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/12.jpg)
Problem 3: split brain
writer A writer A writer A
writer A’
2 writers !
writer A’
single writer for A
![Page 13: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/13.jpg)
Problem 4: failure detection
A
B
C
![Page 14: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/14.jpg)
Problem 5: recovery
Recovery protocol?
For many systems,we need consistent datafor recovery
![Page 15: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/15.jpg)
Solutions !
Steven Depolo cc-by-2.0 https://flic.kr/p/9APgFF
![Page 16: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/16.jpg)
guarantees
protocols
tools / building blocks
techniques
![Page 17: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/17.jpg)
Primary backup, active replication
active standby
active
active
![Page 18: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/18.jpg)
Replication
f+1 replicas for f concurrent failures
![Page 19: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/19.jpg)
Quorums
ensemble
quorum 1
quorum 2 quorum 3
![Page 20: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/20.jpg)
Useful building block: ZooKeeper
● Centralized coordination serviceo configuration, service discoveryo locking, queues, barrierso failure detectiono leader election, membership
● Reliable
● Source of truth
![Page 21: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/21.jpg)
Failure detection with ZooKeeper
ZooKeeper
Ephemeral znodes
Heartbeats
Timeouts
Triggers cluster update
![Page 22: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/22.jpg)
Recovery: protocol
1. provision / acquire new node
2. fetch under-replicated data
3. rebuild state
4. join ensemble
![Page 23: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/23.jpg)
Requirement: durability
sync mutationsto storage
append-onlyjournal
Journaling:persisting mutations
Imaging:persisting the state when it changes
![Page 24: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/24.jpg)
Concretely
● Write-Ahead Logging
● Databases - data stores
● Durable Messaging
![Page 25: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/25.jpg)
Enter Apache BookKeeper
Reliable distributed logging
CC-BY-2.0 https://flic.kr/p/dSHr87
![Page 26: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/26.jpg)
BookKeeper : durability service
durability
replication consistency
on commodity hardware
recovery
user library
A building block for reliable systems
![Page 27: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/27.jpg)
The ledger abstraction
op op op op op op op op op op opop opop opop op
addread
checkpoint
Ledger 1Ledger 2 Ledger 3
![Page 28: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/28.jpg)
Guarantees
If an entry has been acknowledged,
it must be readable
If an entry is read once,
it must always be readable
![Page 29: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/29.jpg)
History
Initial use case : Hadoop name node recovery
2008: open sourced contrib of ZooKeeper
2011: sub-project of ZooKeeper
2012: Production
![Page 30: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/30.jpg)
Community
Committers from:
● Yahoo!● Twitter● Microsoft● Huawei● Facebook
![Page 31: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/31.jpg)
Inside of Apache BookKeeper
CC-BY-2.0 https://flic.kr/p/agpLTR
![Page 32: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/32.jpg)
ledgerçledger
Architecture
bookie
zookeeper
bookie
bookie
client library
client system
ledgerledger entry
index
metadatastore
![Page 33: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/33.jpg)
Write path
ledgerçledger
bookie
bookie
bookie
client library
ledgerledger entry
replication+striping
![Page 34: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/34.jpg)
Reliable writes
● store digest along with entry
● fsync each entry before returning
● ACK when:○ all previous
entries ○ this entry
accepted by quorum
![Page 35: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/35.jpg)
Read path
bookie
bookie
bookie
read(ledger X entry Y)
![Page 36: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/36.jpg)
Partial writes
bookie
bookie
bookie
read(ledger X entry Y)no quorum for this entry!
Read would be inconsistent
![Page 37: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/37.jpg)
Last Add Confirmed
Consensus on written entries
bookie
bookie
bookie
read(ledger X entry Y)
Zookeeper
close(ledger X)
Ledger X, LAC
![Page 38: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/38.jpg)
Recovery of a ledger
bookie
bookie
bookie
zookeeper
What is the last entry?
piggy-backlast add confirmed
![Page 39: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/39.jpg)
Fencing: prevent multiple brains
writer writer
writer writer
![Page 40: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/40.jpg)
roll
Inside of a bookie
L2 - E7L2 - E6
L2 - E3L2 - E4
L1 - E4
L1 - E2L2 - E1
L1 - E1
L3 - E1
...
sequential entries
interleaved
physical file
![Page 41: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/41.jpg)
Storage device
disk
fsync Sequential entriesSynchronous writes
OK for writing
Reads interfere with writes
add ack
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1
![Page 42: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/42.jpg)
Separate read and write devices
disk 2
fsync
ackadd
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1disk 1
L2 - E3L3 - E7
L1 - E4L1 - E2
L2 - E1
L1 - E1
async flush
cache Similar ratesDurabilityRead-efficient
INDEX
Ledger device Journal device
![Page 43: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/43.jpg)
![Page 44: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/44.jpg)
Garbage collection / compaction
disk 2
L2 - E3L3 - E7
L1 - E4
L1 - E2L2 - E1
L1 - E1
disk 1
L2 - E3L3 - E7
L2 - E1L1 - E4L1 - E2L1 - E1
L1 - E4L1 - E2
L2 - E1L1 - E1
L1 - E4
L1 - E2L2 - E1
L1 - E1
Ledger 1 deleted
L2 - E1
Entry log Journal
![Page 45: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/45.jpg)
![Page 46: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/46.jpg)
Using Apache BookKeeper
as a building block
Raul Hernandez - CC-BY-2.0 https://flic.kr/p/aSwTKT
![Page 47: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/47.jpg)
Guarantees
If an entry has been acknowledged,
it must be readable
If an entry is read once,
it must always be readable
![Page 48: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/48.jpg)
API
BookKeeper
createLedgeropenLedgerdeleteLedger
LedgerHandle
addEntryreadEntryclose
asyncCreateLedgerasyncOpenLedgerasyncDeleteLedger
asyncAddEntryasyncReadEntryasyncClose
Asynchronouswithcallbacks
![Page 49: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/49.jpg)
Tech stack
● Java● Netty● ZooKeeper
![Page 50: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/50.jpg)
Performance considerations
I/O bound- disk IOPS: ~ 120/s HDD, 500 000/s SSD- network: 1Gb/s ~ 100MB/s max or less in practice
~ 1KB msgs: 100 000/s per node
![Page 51: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/51.jpg)
Public use cases
● Hadoop namenode (Huawei)
● WAL (HubSpot)
● Hedwig (open source)
● PNUTS cross-colo replication (Yahoo)
● Push notifications (Yahoo)
● Cloud messaging (Yahoo)
![Page 52: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/52.jpg)
Primary backup
BookKeeper
active standby
write tail
apply opsbuild backup state
Reads from open ledger
Asks for current Last Add Confirmedfrom bookies
![Page 53: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/53.jpg)
Data store WAL
Bookkeeper library
bookies
datastore
![Page 54: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/54.jpg)
Bookkeeper library
bookies
Data structure
Durability for arbitrary (distributed)data structures!
![Page 55: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/55.jpg)
Elasticity
Bookkeeper library
bookies
![Page 56: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/56.jpg)
Elasticity
Bookkeeper library
bookies
![Page 57: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/57.jpg)
Shared log infrastructure
Bookkeeper library
bookies
Application A Application B
System C
![Page 58: Building reliable systems with Apache BookKeeper](https://reader036.vdocuments.mx/reader036/viewer/2022062300/557d0146d8b42a98158b5236/html5/thumbnails/58.jpg)
http://zookeeper.apache.org/bookkeeper/
https://github.com/apache/bookkeeper
Matthieu Morel: mmorel
Ivan Kelly: ivank
@apache.org