Inside Cassandra
Michael Penick
Overview
• To disk and back again• Cassandra Internals by Aaron Morton• Goals– RDBMS comparison to C*– Make educated decisions
I’m configuration
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
E F
G H
I J
K L
M N
O P
Location = Hash(Key) % # Nodes
Node 4
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
F G
H
K
J
LP
O
M
I
N
E
% Data Moved = 100 * N / (N + 1)
Consistent Hashing0
Node 1
Node 2Node 3
Node 4
Consistent Hashing0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
PAdd Node 0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
% Data Moved = 100 * 1 / N
Virtual Nodes
Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
num_tokens
initial_token
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
Hinted Handoff0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3and
hinted_handoff_enabled = true
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY
Write locally: system.hints
Note: Doesn’t not count toward consistency level (except ANY)
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
Appends FWD_TO parameter to
message
Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
SELECT * FROM table USING CONSISTENCY ONE
replication_factor = 3and
read_repair_chance > 0
Write
Memory
Disk
Commit Log
Memtable
K1 C1:V1 C2:V2
K1 C1:V1 C2:V2
SSTable #1
K1 C1:V1 C2:V2
…
… …
Flush when:> commitlog_total_space_in_mb
or> memtable_total_space_in_mb
Write
Memory
Disk
Commit Log
Memtable
K1 C3:V3
K1 C3:V3
SSTable #1 SSTable #2
K1 C1:V1 C2:V2
…
… … …
Note: All writes are sequential!
Physical Volume #1 Physical Volume #2
K1 C3:V3
Commit Log
Mutation #3 Mutation #2 Mutation #1 Commit LogExecutor
Commit LogAllocatorSegment #3 Segment #2 Segment #1 Segment #1
Commit LogFile
Memory
Disk
Commit LogFile
Commit LogFile
Flush! Write!commitlog_segment_size_in_mb
Commit Log
• commitlog_sync1. periodic (default)• commitlog_sync_period_in_ms (default: 10 seconds)
2. batch• commitlog_batch_window_in_ms
Memtable
• ConcurrentSkipListMap<RowPosition, AtomicSortedColumns> rows;
• AtomicSortedColumns.Holder– DeletionInfo deletionInfo; // tombstone– SnapTreeMap<ByteBuffer, Column> map;
• Goals– Fast operations– Fast concurrent access– Fast in-order iteration– Atomic/Isolated operations within a column family
Skip List
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Get 7
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
ConcurrentSkipListMap uses: p = 0.5
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
H 1 3 T
2
H 1 3 T
2
CAS
Skip List
while(true):next = current.nextnew_node.next = nextif(CompareAndSwap(current.next, next,
new_node)):break
Skip List
H 1 3 T
H 1 3 T
2
CAS
I’m lost!
Skip ListH 1 3 T
CAS
H 1 3 T
H 1 3 T
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
CAS
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
SnapTree
3
2 5
1 4 6
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)
SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
Balance Factor must be -1, 0 or +1
SnapTree5
3
4A
B C
D
5
4
3
A B
C
D
4
3 5
A B C D
Left-Right Case
Left-Left Case
SnapTree3
5
4D
CB
A
3
4
5
DC
B
A
4
3 5
A B C D
Right-Left Case
Right-Right Case
SnapTree5
2 6
1 3
4
Node Balance Factor1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
SnapTree
Node Balance Factor1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
3
2 5
1 4 6
Epoch
SnapTree
5
2 6
1 3
4
Root
Lock
4
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Insert
Epoch
SnapTree
5
2 6
1 3
4
Root4Get
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Epoch
SnapTree
Root
5
2
6
1
3
4
4Get
Does Version(5) == 0?
NO! Go back to 5
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock : (
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock
SetValue(3, null)
SnapTreeEpoch #1
Root
3
2 5
1 4 6
Clone StopDelete
Insert
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Clone
Epoch #3
Root
I’m shared!
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
7
Snap Tree
C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307
SSTableFilter.db Data.db
K1
K2
K3
C1
C1
C2
C2
C3
CRC.db0xFFCC23ED
0x1FEA2321
0xCE652133
Index.db
K1
K2
K3
00001
00002
00003
CompressionInfo.db
00001
00002
00003
00001
00004
00006
Compression? NoYes
• CASSANDRA-2319• Promote row index
• CASSANDRA-4885• Remove … per-row
bloom filters
Delete
• Essentially a write (mutation)• Data not remove immediately, but a
tombstone record added• tombstone time > gc_grace = data removed
(compaction)
Bloom Filter
Bloom Filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
K1Hash InsertHashHash
hash = murmur3(key) # creates two hashesfor i in count(hash):
result[i] = abs(hash[0] + i * hash[1]) % num_keys)
Bloom Filter
Bloom Filter Probability Calculation
Config: bloom_filter_fp_chance,and
SSTable: number of rows
Num hashes, and
Num bits per entry
Read
Memory
Disk
Memtable
K1 C4:V4
SSTable #2
K1 C3:V3
SSTable #1
K1 C1:V1 C2:V2
…
… …
Memtable
K1 C5:V5
… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5
Row Cache
= Off-heap
row_cache_size_in_mb > 0
Read
Memory
Disk
Bloom Filter
Key Cache
Partition Summary
Compression Offsets
Partition Index Data
Cache Hit
Cache Miss
= Off-heap
key_cache_size_in_mb > 0
index_interval = 128 (default)
Compaction (Size-tiered)
Compaction (Size-tiered)
min_compaction_threshold = 4
Memtable flush!
Compaction (Size-tiered)
Compaction (Leveled)
Memtable flush!
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10
sstable_size_in_mb = 160
L2: 160 MB x 100
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100
…
Topics
• CAS (PAXOS)• Anti-entropy (Merkel trees)• Gossip (Failure detection)
Thanks