how cassandra deletes data (alain rodriguez, the last pickle) | cassandra summit 2016
TRANSCRIPT
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
HOW CASSANDRA DELETES DATAAlain Rodriguez
Deleted data in Cassandra do not just disappear,
instead a tombstone is added.
About deletes in Cassandra
Ok so what’s the matter, why this talk ?
Tombstone are needed in Cassandra, not an issue…
…until an SSTables or a result to a query look like this…
Then we can see that in the user mailing list or other community tools
Ok so what’s the matter, why this talk ?
Then we can see that in the user mailing list or other community tools
So I thought I could share,about this topic.
thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Ok so what’s the matter, why this talk ?
Tombstone issues: impacts
The read path: Reading tombstones induces
Latencies, Timeouts or Exceptions
Tombstone issues: impacts
The read path: Reading tombstones induces
Latencies, Timeouts or Exceptions
The disk space: tombstones can fill up the disk
100%
Tombstone issues: impacts
The read path: Reading tombstones induces
Latencies, Timeouts or Exceptions
The disk space: tombstones can fill up the disk
I am facing one of these issues, is it caused by tombstones?
100%
Tombstone issues: Read Path
grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log
WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see
tombstone_warn_threshold). 500 columns were requested, slices=[-]
Tombstone issues: Read Path
grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log
WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see
tombstone_warn_threshold). 500 columns were requested, slices=[-]
ERROR [ReadStage:290729] 2016-07-16 17:00:18,708 SliceQueryFilter.java (line 206) Scanned over 100000 tombstones in mykeyspace.mytable; query aborted (see tombstone_failure_threshold) ERROR [ReadStage:290729] 2016-04-22 17:00:18,709 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:290729,5,main]
java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
Tombstone issues: Read Path
tombstoneScannedHistogram metric
Through or a plugged monitoring tool such as Datadog, Grafana, SPM, OpsCenter…
Commercial
Free
Tombstone issues: Disk space
DroppableTombstoneRatio metric provide interesting info.
Through sstablemetadata tool, JMX and plugged monitoring tool such as Datadog, Grafana, SPM, OpsCenter, etc.
Possible to write a script to check biggest SSTables ratio for example
Why Tombstones: Cassandra write pathWrite path
Client write
Memory
Disk
Memtable
Commit Log SSTable SSTable
SSTable SSTable
Cassandra node
Flush
Immutable
Why Tombstones: Cassandra write pathWrite path
Client write
Memory
Disk
Memtable
Commit Log SSTable SSTable
SSTable SSTable
Cassandra node
Immutable
Client read
Flush
Why Tombstones: Distributed system
Cassandra is a distributed system
Distributed deletes are tricky !
Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
Why Tombstones: Cassandra consistency & availability Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
Down
Client write “A”
Client read “A”
Ack
Ack
High availability
Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “A”
Ack
Ack
Wrong
Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “empty”
Ack
Ack
Correct
Why Tombstones: Distributed deletes
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “A”
Ack
Ack
Wrong
Why Tombstones: Distributed deletes
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “A”
Ack
Ack
Wrong
Why Tombstones: Distributed deletes
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “A”
Ack
Ack
Wrong
Why Tombstones: Distributed deletes
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”
Client read “A*”meaning “empty”
Ack
Ack
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”
Client read “A”
Ack
Ack
Wrong Correct
When are tombstones removed?
When should tombstones be removed?• Once the tombstone is fully replicated• When deleted data has been removed
When are tombstones removed?
When should tombstones be removed?• Once the tombstone is fully replicated• When deleted data has been removed
When are tombstones actually removed?• After gc_grace_seconds• During compactions
IF all the deleted data and the tombstone itself are involved
How tombstones are removed: Compaction!Write path
Client write
Memory
Disk
Memtable
Commit Log SSTable SSTable
SSTable SSTable
Cassandra node
Immutable
Client read
Flush
How tombstones are removed: Compaction!Write path
Client write
Memory
Disk
Memtable
Commit Log SSTable SSTable
SSTable SSTable
Cassandra node
Immutable
Client read
Compacting 4 SSTables
Flush
How tombstones are removed: Compaction!Write path
Client write
Memory
Disk
Memtable
Commit Log
SSTable
Cassandra node
Immutable
Client read
Flush
Implications in the real world
• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)
Implications in the real world
• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)
• Overlapping SSTable = no eviction• Fragmented data = eviction unlikely• LCS: tombstone level ≠ than data level = no eviction
Implications in the real world
• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)
• Overlapping SSTable = no eviction• Fragmented data = eviction unlikely• LCS: tombstone level ≠ than data level = no eviction
• TTL << gc_grace_seconds = high % of useless data
Some tuning !
Issue: No compaction = No eviction
CASSANDRA-3442: tombstone_threshold (C* 1.2.b1)
Compaction option, default:tombstone_threshold = 0.2 (ratio = 20% has been deleted)
Single SSTable compaction triggered based on an estimate!Low risk: worst case —> No-op
Some tuning !
Issue: Tombstone compaction loop!
CASSANDRA-4022: Check for key overlaps (C* 1.2.b1)
Internals improvement, not an option:
Estimated droppable tombstone improvedNow considering key overlapping with other SSTable
Some tuning !
Issue: Tombstone compaction loop!
CASSANDRA-4781: tombstone_compaction_interval (C* 1.2.b2)
Compaction option, default:tombstone_compaction_interval = 86400 (in seconds = 1 day)Definitely prevents loops
Some tuning !
Issue: Compacting to remove tombstone is expensive
CASSANDRA-5228: Expired SSTables (C*2.0.b1)
Internals improvement, not an optionEffective with Time series, DTCS / TWCS and TTLs !
Some tuning !
Issue: Tombstone compactions not triggering
CASSANDRA-6563: unchecked_tombstone_compaction (C* 2.0.9)
Compaction option, default:unchecked_tombstone_compaction = false
CASSANDRA-4022 becomes an option
Some tuning !
Issue: Overlapping preventing efficient tombstone compactions
CASSANDRA-7019: provide_overlapping_tombstones (C* 3.10)
Compaction option, default:provide_overlapping_tombstones = NONE (CELL / ROW / NONE)
Risky: • Not yet released, so not really tested• Heavier tombstones compactions
Some tuning - Tombstone distribution ! WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”
Client read “A*”meaning “empty”
Ack
Ack
Correct
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”
Client read “A*”
Ack
Ack
Correct
Some tuning - Tombstone distribution !
Case were node fail + no repair=
Case without tombstone
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”
Client read “A*”meaning “empty”
Ack
Ack
Correct
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
AClient read “A” Wrong
A* removed
Some tuning - Tombstone distribution !
Case were node fail + no repair=
Case without tombstone=
Zombie data !
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”
Client read “A*”meaning “empty”
Ack
Ack
Correct
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
?
?
?
Client write “A”
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”Ack
Ack
StrongConsistency
Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
?
Client write “A”
Client read “A ”
Ack
Ack
StrongConsistency
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
Client delete “A”
WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
Client delete “A”Ack
Ack
Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
AClient read “A” Wrong
A* removed
Some tuning - Tombstone distribution !
CASSANDRA-6434 (C*3.0.b1):
only_purge_repaired_tombstones(Default: False)
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
A* not removed
Client read “A*”meaning “empty” Correct
Some tuning - Tombstone distribution !
CASSANDRA-6434 (C*3.0.b1):
only_purge_repaired_tombstones(Default: False)
Limitation
Repair failing or no repair=
permanent tombstone
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2
A
A
A
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A
A
A
Client delete “A”
WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
Client delete “A”Ack
Ack
Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A
A*
A*
A
A* not removed
Client read “A*”meaning “empty” Correct
Things we know about tombstones
• Tombstones due to deletes and TTLs• Tombstone fits with Cassandra write path• Tombstones ensure consistency
• Reading tombstones is expensive and can produce failures• Tombstones take space on disk and might be tricky to remove• Tombstones need to be distributed before being removed
Takeaways
• Model data and workflow to avoid to reading many tombstones
• Deleted data = repair table within gc_grace_seconds
• Monitor tombstones, keep control! (Set some alerts ?)
• Use compaction options to tackle problems, there is always a way.
• Is there no way? Ask, or create a Jira and keep improving Cassandra!
Thank youQuestions ?
thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html