anti-entropy using crdts on ha datastores › system › files › presentation-slides ›...
TRANSCRIPT
Anti-Entropy using CRDTs on HA Datastores
Sailesh MukilSenior Software Engineer, Netflix
Timeline
NETFLIX
Cassandra adoption
2011 2013 2016
Multi-region Dynomite
Dynomite
NETFLIX
Makes non-distributed datastores, distributed
NETFLIX
Datastore
33% 33%
33%
Dynomite Overview
NETFLIX
Replica 1 Replica 2 Replica 3
Dynomite Overview
NETFLIX
Replica 1 Replica 2 Replica 3
Client
NETFLIX
Replica 1 Replica 2 Replica 3
Client
NETFLIX
Replica 1 Replica 2 Replica 3
Client
NETFLIX
Dynomite overview
● Global replication● High availability● Shared nothing● Auto-sharding● Linear scale
● Pluggable datastores (Redis primarily)
● Multiple quorum levels
● Supports datastore API
NETFLIX
Dynomite footprint @ Netflix
● ~1000 customer facing nodes● ~1M OPS/s● Largest cluster holds ~6 TB
The problem
NETFLIX
Entropy in the system
NETFLIX
R-2 R-3
R-1
Entropy in the system SET K 123
NETFLIX
R-2 R-3
R-1
Entropy in the system SET K 123
K: 123
K: 123 K: 123
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123
OK
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123
SET K 456
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123
SET K 456
K: 456
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123K: 456
ERR
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123K: 456
SET K 789
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123K: 456
SET K 789
K: 789
NETFLIX
R-2 R-3
R-1
Entropy in the system
K: 123
K: 123 K: 123K: 456K: 789
ERR
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456
GET K
K: 789
GET K
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456K: 789
789 456
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456
GET K (w/quorum)
K: 789
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456
GET K (w/quorum)
K: 789
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456
GET K (w/quorum)
K: 789
123
456
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456K: 789
ERR: QUORUM FAILED
NETFLIX
R-2 R-3
R-1
K: 123
K: 123 K: 123K: 456K: 789
123
456
ERR: QUORUM FAILED
NETFLIX
Replicas will go out of sync
Timeline
NETFLIX
Cassandra adoption
2011 2013 2016
Multi-region DynomiteDynomite w/ CRDTs
2019
NETFLIX
Last Writer Wins Vector Clocks
Achieving anti-entropy(traditionally)
● Uses Physical timestamps● Clock skew
● Shows causal relationships● But not for concurrent writes
The solution
NETFLIX
Conflict free replicated data types
Conflict free replicated data types
NETFLIX
SECTION DIVIDER
A CRDT is a data structure which can be replicated across the network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result.
NETFLIX
Associative Commutative Idempotent
Grouping of operations does not matter
(X + Y) + Z = X + (Y + Z)
Order of operations do not matter
X + Y = Y + X
Duplication of operations does not
matter
X + X = X
NETFLIX
Update Merge
Types of operations on CRDTs
● Updates local state ● Converges replica states
NETFLIX
When we write, we update
When we repair, we merge
Read repair = merge on read path
Introduction to CRDTs
NETFLIX
CRDTs provide strong eventual consistency
Introduction to CRDTs
NETFLIX
R-2 R-3
R-1
Naive distributed counter
CTR: 1
CTR: 1 CTR: 1
INCR CTR
NETFLIX
R-2 R-3
R-1
Naive distributed counter
CTR: 1
CTR: 1 CTR: 1
DECR CTR INCR CTR
CTR: 0 CTR: 2
NETFLIX
R-2 R-3
R-1
Naive distributed counter
CTR: 1
CTR: 1 CTR: 1CTR: 0 CTR: 2
Repair based on timestamp?
Latest value is 2, which is incorrect
CRDT: PNCounters
NETFLIX
Each replica maintains 2 “local” counters● Positive counter: Tracks increments● Negative counter: Tracks decrements
Final counter value:(Sum of all PCounters - Sum of all NCounters)
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter INCR CTR
0 0 0
0 0 0CTR:
0 0 0
0 0 0CTR:
0 0 0
0 0 0CTR:
1
1 1
0 00 0 01 0 0
0 0 01
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter
0 0 0
0 0 0CTR:
0 0 0
00 0CTR:
0 0 0
0 0 0CTR:
1
1 1
DECR CTR INCR CTR
1
1
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter
0 0 0
0 0 0CTR:
0 0 0
00 0CTR:
0 0 1
0 0 0CTR:
1
1 1
1
1
CTR = 0
CTR = 1
CTR = 2
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter
0 0 0
0 0 0CTR:
0 0 0
00 0CTR:
0 0 1
0 0 0CTR:
1
1 1
1
1
GET CTR
0 00 01
1
0 10 0 01
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter
0 0 0
00 0CTR:
0 0 0
00 0CTR:
0 0 1
00 0CTR:
1
1 1
1
1
GET CTR1
1
repair(merge)
repair(merge)
repair(merge)
1
1
NETFLIX
R-2 R-3
R-1
CRDT: PNCounter
0 0 0
00 0CTR:
0 0 0
00 0CTR:
0 0 1
00 0CTR:
1
1 1
1
1
1
1
1
1
CTR = 1
CTR = 1
CTR = 1
CRDT: LWW-Element Set
NETFLIX
Used to maintain key metadata● Add set: Latest update timestamps for keys● Remove set: Timestamps at which keys were removed
Registers can take arbitrary values● Hence we still require LWW to resolve conflicts
Used for registers, hashmaps and sorted sets
NETFLIX
R-2 R-3
R-1
LWW-Element Set SET K1 123 (t1)
add
rem
add
rem
add
rem
K1t1
K1t1
K1t1
K1: 123
K1: 123 K1: 123
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1t1
K1t1
K1t1
K1: 123
K1: 123 K1: 123
SET K1 456 (t2)
t2
K1: 456
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1t1
K1t1
K1t1
K1: 123
K1: 123
t2
SET K2 999 (t3)
K2t3
K2: 999
K2t3
K2: 999
K1: 456
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1t1
K1t1
K1t1
K1: 123
K1: 123
t2K2t3
K2: 999
K2t3
K2: 999
K1: 456
GET K1
K1 = 456 (t2)K1 = 123 (t1)
t2 > t1=> 456 latest value
t2
K1: 456
repair
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1t1
K1t1
K1: 123
t2K2t3
K2: 999
K2t3
K2: 999
K1: 456
t2
K1: 456
“456”
repair
t2
K1: 456
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2: 999
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
GET K2
(nil)K2 = 999 (t3)
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2: 999
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
“999”
repair
K2t3
K2: 999
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2: 999
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
K2t3
K2: 999
DEL K2 (t4)
K2t4
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
K2t3
K2: 999
GET K2“999”
K2t4
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
K2t3
K2: 999
GET K2
K2 del @t4
K2t4
K2 = 999 (t3)
K2t4
NETFLIX
R-2 R-3
R-1
LWW-Element Set add
rem
add
rem
add
rem
K1
K1 K1t1t2
K2t3
K2t3
K2: 999
K1: 456
t2
K1: 456
t2
K1: 456
K2t3
(nil)
K2t4
DEL K2 (t4)
K2t4
K2t4
repair
Implementation challenges (LWW-element set)
NETFLIX
Redis doesn’t maintain timestampsDynomite can track the timestamp of the client request
Implementation challenges (LWW-element set)
NETFLIX
We’d like Dynomite to remain statelessStore the metadata inside Redis
Implementation challenges (LWW-element set)
NETFLIX
Operations must modify data and metadata atomicallyRewrite operations into Redis Lua scripts (guarantees atomicity)
Implementation challenges (LWW-element set)
NETFLIX
Does the remove set grow forever?Delete metadata ASAP from remove set if ALL replicas agreeBackground thread cleans restMaintain remove set as sorted set
Implementation challenges (LWW-element set)
NETFLIX
What does an example Lua script look like?Check if update is oldDiscard if it isUpdate data + metadata otherwise
NETFLIX
Repairs occur on read path in DynomiteRepairs for point reads only
Background repairs
NETFLIX
(Note: Ongoing work)
NETFLIX
Repairing on range reads is expensiveEg: Give me all members of a set
Return everything in this hashmapReturn me a range from this sorted set
Background repairs
NETFLIX
How do we target keys that need repairing?Full key walk? (like Cassandra)
Background repairs
NETFLIX
How do we target keys that need repairing?Maintain list of recently written to keys
Background repairs
Run merge operation on them (async)But, merge operation on large structures are expensive
NETFLIX
Delta-state CRDTs
Maintain list of recent mutations done to keys
Background repairs
Ship only delta-state instead of entire data structure for mergeConfirm which replicas have received it
NETFLIX
0
00CTR:
0 0
00CTR:
1 1
1
1
1
Background repairs What is a delta-state?
INCR CTR
2
0
0 1
2
Full state
R1 R2
NETFLIX
0
00CTR:
0
00CTR:
1
1
1
1
Background repairs What is a delta-state?
INCR CTR
2R1 = 2
Delta state
2
R1 R2
NETFLIX
Background repairs What is a delta-state?
R1 R3
R2
R2 R3Mutations
𝜹-1𝜹-2
𝜹-3
𝜹-4
NETFLIX
Background repairs What is a delta-state?
R1 R3
R2
R2 R3Mutations
𝜹-1𝜹-2
𝜹-3
𝜹-4
ACK
ACK
NETFLIX
Background repairs What is a delta-state?
R1 R3
R2
R2 R3Mutations
𝜹-1𝜹-2
𝜹-3
𝜹-4
ack ack
ackack
NETFLIX
Background repairs What is a delta-state?
R1 R3
R2
R2 R3Mutations
𝜹-1𝜹-2
𝜹-3
𝜹-4
ack ack
ackack
ACK
NETFLIX
Background repairs What is a delta-state?
R1 R3
R2
R2 R3Mutations
𝜹-1𝜹-2
𝜹-3
𝜹-4
ack ack
ackack
NETFLIX
Challenge with Delta-state CRDTsDurability
Background repairs
Practical overhead of maintaining list
Sailesh Mukilsmukil@netflix
Thank You.