a deep dive into understanding apache cassandra

77
Inside Cassandra Michael Penick

Upload: planet-cassandra

Post on 17-Jun-2015

2.693 views

Category:

Technology


8 download

DESCRIPTION

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.

TRANSCRIPT

Page 1: A Deep Dive Into Understanding Apache Cassandra

Inside Cassandra

Michael Penick

Page 2: A Deep Dive Into Understanding Apache Cassandra

Overview

• To disk and back again• Cassandra Internals by Aaron Morton• Goals– RDBMS comparison to C*– Make educated decisions

I’m configuration

Page 3: A Deep Dive Into Understanding Apache Cassandra

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

E F

G H

I J

K L

M N

O P

Location = Hash(Key) % # Nodes

Page 4: A Deep Dive Into Understanding Apache Cassandra

Node 4

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

F G

H

K

J

LP

O

M

I

N

E

% Data Moved = 100 * N / (N + 1)

Page 5: A Deep Dive Into Understanding Apache Cassandra

Consistent Hashing0

Node 1

Node 2Node 3

Node 4

Page 6: A Deep Dive Into Understanding Apache Cassandra

Consistent Hashing0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

PAdd Node 0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

P

% Data Moved = 100 * 1 / N

Page 8: A Deep Dive Into Understanding Apache Cassandra

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Page 9: A Deep Dive Into Understanding Apache Cassandra

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

Page 10: A Deep Dive Into Understanding Apache Cassandra

Hinted Handoff0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3and

hinted_handoff_enabled = true

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY

Write locally: system.hints

Note: Doesn’t not count toward consistency level (except ANY)

Page 11: A Deep Dive Into Understanding Apache Cassandra

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

Appends FWD_TO parameter to

message

Page 12: A Deep Dive Into Understanding Apache Cassandra

Read Repair

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

SELECT * FROM table USING CONSISTENCY ONE

replication_factor = 3and

read_repair_chance > 0

Page 13: A Deep Dive Into Understanding Apache Cassandra

Write

Memory

Disk

Commit Log

Memtable

K1 C1:V1 C2:V2

K1 C1:V1 C2:V2

SSTable #1

K1 C1:V1 C2:V2

… …

Flush when:> commitlog_total_space_in_mb

or> memtable_total_space_in_mb

Page 14: A Deep Dive Into Understanding Apache Cassandra

Write

Memory

Disk

Commit Log

Memtable

K1 C3:V3

K1 C3:V3

SSTable #1 SSTable #2

K1 C1:V1 C2:V2

… … …

Note: All writes are sequential!

Physical Volume #1 Physical Volume #2

K1 C3:V3

Page 15: A Deep Dive Into Understanding Apache Cassandra

Commit Log

Mutation #3 Mutation #2 Mutation #1 Commit LogExecutor

Commit LogAllocatorSegment #3 Segment #2 Segment #1 Segment #1

Commit LogFile

Memory

Disk

Commit LogFile

Commit LogFile

Flush! Write!commitlog_segment_size_in_mb

Page 16: A Deep Dive Into Understanding Apache Cassandra

Commit Log

• commitlog_sync1. periodic (default)• commitlog_sync_period_in_ms (default: 10 seconds)

2. batch• commitlog_batch_window_in_ms

Page 17: A Deep Dive Into Understanding Apache Cassandra

Memtable

• ConcurrentSkipListMap<RowPosition, AtomicSortedColumns> rows;

• AtomicSortedColumns.Holder– DeletionInfo deletionInfo; // tombstone– SnapTreeMap<ByteBuffer, Column> map;

• Goals– Fast operations– Fast concurrent access– Fast in-order iteration– Atomic/Isolated operations within a column family

Page 18: A Deep Dive Into Understanding Apache Cassandra

Skip List

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 19: A Deep Dive Into Understanding Apache Cassandra

Skip List

Get 7

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 20: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 21: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 22: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Page 23: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Page 24: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 25: A Deep Dive Into Understanding Apache Cassandra

Skip List

ConcurrentSkipListMap uses: p = 0.5

Page 26: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 27: A Deep Dive Into Understanding Apache Cassandra

Skip List

H 1 3 T

2

H 1 3 T

2

CAS

Page 28: A Deep Dive Into Understanding Apache Cassandra

Skip List

while(true):next = current.nextnew_node.next = nextif(CompareAndSwap(current.next, next,

new_node)):break

Page 29: A Deep Dive Into Understanding Apache Cassandra

Skip List

H 1 3 T

H 1 3 T

2

CAS

I’m lost!

Page 30: A Deep Dive Into Understanding Apache Cassandra

Skip ListH 1 3 T

CAS

H 1 3 T

H 1 3 T

CAS

Page 31: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Page 32: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

CAS

Page 33: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Page 34: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Page 35: A Deep Dive Into Understanding Apache Cassandra

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Page 36: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Page 37: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Page 38: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

CAS

Page 39: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Page 40: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Page 41: A Deep Dive Into Understanding Apache Cassandra

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Page 42: A Deep Dive Into Understanding Apache Cassandra

SnapTree

3

2 5

1 4 6

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)

Page 43: A Deep Dive Into Understanding Apache Cassandra

SnapTree

5

2 6

1 3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

Balance Factor must be -1, 0 or +1

Page 44: A Deep Dive Into Understanding Apache Cassandra

SnapTree5

3

4A

B C

D

5

4

3

A B

C

D

4

3 5

A B C D

Left-Right Case

Left-Left Case

Page 45: A Deep Dive Into Understanding Apache Cassandra

SnapTree3

5

4D

CB

A

3

4

5

DC

B

A

4

3 5

A B C D

Right-Left Case

Right-Right Case

Page 46: A Deep Dive Into Understanding Apache Cassandra

SnapTree5

2 6

1 3

4

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

Page 47: A Deep Dive Into Understanding Apache Cassandra

SnapTree

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

3

2 5

1 4 6

Page 48: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

5

2 6

1 3

4

Root

Lock

4

Version(5) is 0

Version(2) is 0

Does Version(5) == 0?

Insert

Page 49: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

5

2 6

1 3

4

Root4Get

Version(5) is 0

Version(2) is 0

Does Version(5) == 0?

Page 50: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

Root

5

2

6

1

3

4

4Get

Does Version(5) == 0?

NO! Go back to 5

Page 51: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock : (

Page 52: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

Page 53: A Deep Dive Into Understanding Apache Cassandra

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

SetValue(3, null)

Page 54: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #1

Root

3

2 5

1 4 6

Clone StopDelete

Insert

Page 55: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Clone

Epoch #3

Root

I’m shared!

Page 56: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

Page 57: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

Page 58: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

Page 59: A Deep Dive Into Understanding Apache Cassandra

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

7

Page 60: A Deep Dive Into Understanding Apache Cassandra

Snap Tree

C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307

Page 61: A Deep Dive Into Understanding Apache Cassandra

SSTableFilter.db Data.db

K1

K2

K3

C1

C1

C2

C2

C3

CRC.db0xFFCC23ED

0x1FEA2321

0xCE652133

Index.db

K1

K2

K3

00001

00002

00003

CompressionInfo.db

00001

00002

00003

00001

00004

00006

Compression? NoYes

• CASSANDRA-2319• Promote row index

• CASSANDRA-4885• Remove … per-row

bloom filters

Page 62: A Deep Dive Into Understanding Apache Cassandra

Delete

• Essentially a write (mutation)• Data not remove immediately, but a

tombstone record added• tombstone time > gc_grace = data removed

(compaction)

Page 63: A Deep Dive Into Understanding Apache Cassandra

Bloom Filter

Page 64: A Deep Dive Into Understanding Apache Cassandra

Bloom Filter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

K1Hash Insert

Page 65: A Deep Dive Into Understanding Apache Cassandra

Bloom Filter

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

K1Hash Insert

Page 66: A Deep Dive Into Understanding Apache Cassandra

Bloom Filter

1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0

K1Hash InsertHashHash

hash = murmur3(key) # creates two hashesfor i in count(hash):

result[i] = abs(hash[0] + i * hash[1]) % num_keys)

Page 67: A Deep Dive Into Understanding Apache Cassandra

Bloom Filter

Bloom Filter Probability Calculation

Config: bloom_filter_fp_chance,and

SSTable: number of rows

Num hashes, and

Num bits per entry

Page 68: A Deep Dive Into Understanding Apache Cassandra

Read

Memory

Disk

Memtable

K1 C4:V4

SSTable #2

K1 C3:V3

SSTable #1

K1 C1:V1 C2:V2

… …

Memtable

K1 C5:V5

… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5

Row Cache

= Off-heap

row_cache_size_in_mb > 0

Page 69: A Deep Dive Into Understanding Apache Cassandra

Read

Memory

Disk

Bloom Filter

Key Cache

Partition Summary

Compression Offsets

Partition Index Data

Cache Hit

Cache Miss

= Off-heap

key_cache_size_in_mb > 0

index_interval = 128 (default)

Page 70: A Deep Dive Into Understanding Apache Cassandra

Compaction (Size-tiered)

Page 71: A Deep Dive Into Understanding Apache Cassandra

Compaction (Size-tiered)

min_compaction_threshold = 4

Memtable flush!

Page 72: A Deep Dive Into Understanding Apache Cassandra

Compaction (Size-tiered)

Page 73: A Deep Dive Into Understanding Apache Cassandra

Compaction (Leveled)

Memtable flush!

Page 74: A Deep Dive Into Understanding Apache Cassandra

Compaction (Leveled)

L0: 160 MB L1: 160 MB x 10

sstable_size_in_mb = 160

L2: 160 MB x 100

Page 75: A Deep Dive Into Understanding Apache Cassandra

Compaction (Leveled)

L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100

Page 76: A Deep Dive Into Understanding Apache Cassandra

Topics

• CAS (PAXOS)• Anti-entropy (Merkel trees)• Gossip (Failure detection)

Page 77: A Deep Dive Into Understanding Apache Cassandra

Thanks