cassandra eu 2012 - storage internals by nicolas favre-felix

63
Cassandra storage internals Nicolas Favre-Felix Cassandra Europe 2012

Upload: acunu

Post on 15-Jan-2015

4.369 views

Category:

Technology


0 download

DESCRIPTION

Nicolas' talk from Cassandra Europe on March 28th 2012.

TRANSCRIPT

Page 1: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Cassandrastorage internals

Nicolas Favre-FelixCassandra Europe 2012

Page 2: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

What this talk covers

• What happens within a Cassandra node

• How Cassandra reads and writes data

• What compaction is and why we need it

• How counters are stored, modified, and read

Page 3: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Concepts• Memtables

• SSTables

• Commit Log

• Key cache

• Row cache

• On heap, off-heap

• Compaction

• Bloom filters

• SSTable index

• Counters

Page 4: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Why is this important?• Understand what goes on under the hood

• Understand the reasons for these choices

• Diagnose issues

• Tune Cassandra for performance

• Make your data model efficient

Page 5: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

A word about hard drives

Page 6: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

A word about hard drives

• Main driver behind Cassandra’s storage choices

• The last moving part

• Fast sequential I/O (150 MB/s)

• Slow random I/O (120-200 IOPS)

Page 7: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

What SSDs bring• Fast sequential I/O

• Fast random I/O

• Higher cost

• Limited lifetime

• Performance degradation

Page 8: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Disk usage with B-trees

• Important data structure in relational databases

• In-place overwrites (random I/O)

• LogB(N) random accesses for reads and writes

Page 9: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Disk usage with Cassandra• Made for spinning disks

• Sequential writes, much less than 1 I/O per insert

• Several layers of cache

• Random reads, approximately 1 I/O per read

• Generally “write-optimised”

Page 10: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Writingto Cassandra

Page 11: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Writing to Cassandra

Row Key Column Column Column Column

Let’s add a row with a few columns

Page 12: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Memtable

In the JVM

On disk Commit log

New data

The Cassandra write path

Page 13: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

The Commit Log• Each write is added to a log file

• Guarantees durability after a crash

• 1-second window during which data is still in RAM

• Sequential I/O

• A dedicated disk is recommended

Page 14: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Memtables• In-memory Key/Value data structure

• Implemented with ConcurrentSkipListMap

• One per column family

• Very fast inserts

• Columns are merged in memory for the same key

• Flushed at a certain threshold, into an SSTable

Page 15: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Full MemtableIn the JVM

On disk Commit log

Dumping a Memtable on disk

Page 16: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

New MemtableIn the JVM

On disk SSTableCommit log

Dumping a Memtable on disk

Page 17: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

The SSTable

• One file, written sequentially

• Columns are in order, grouped by row

• Immutable once written, no updates!

Page 18: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

In the JVM

On disk

Memtable

Commit log

SSTables start piling up!

SSTable SSTable SSTable

SSTable SSTable SSTable

SSTable SSTable SSTable

SSTable SSTable SSTable

Page 19: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

SSTables• Can’t keep all of them forever

• Need to reclaim disk space

• Reads could touch several SSTables

• Scans touch all of them

• In-memory data structures per SSTable

Page 20: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Compacting SSTables

Page 21: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Compaction• Merge SSTables of similar size together

• Remove overwrites and deleted data (timestamps)

• Improve range query performance

• Major compaction creates a single SSTable

• I/O intensive operation

Page 22: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Recent improvements

• Pluggable compaction

• Different strategies, chosen per column family

• SSTable compression

• More efficient SSTable merges

Page 23: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading from Cassandra

Page 24: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading from Cassandra• Reading all these SSTables would be very inefficient

• We have to read from memory as much as possible

• Otherwise we need to do 2 things efficiently:

• Find the right SSTable to read from

• Find where in that SSTable to read the data

Page 25: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

First step for reads

• The Memtable!

• Read the most recent data

• Very fast, no need to touch the disk

Page 26: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

Row cacheOff-heap (no GC)

Page 27: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Row cache

• Stores a whole row in memory

• Off-heap, not subject to Garbage Collection

• Size is configurable per column family

• Last resort before having to read from disk

Page 28: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

In the JVM

On disk

Memtable

Commit log

Finding the right SSTable

SSTable SSTable

SSTable SSTable SSTable

SSTable SSTable SSTable SSTable

Page 29: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Bloom filter• Saved with each SSTable

• Answers “contains(Key) :: boolean”

• Saved on disk but kept in memory

• Probabilistic data structure

• Configurable proportion of false positives

• No false negatives

Page 30: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk Commit log SSTable

Bloom filter

exists(key)?

true/false

Bloom filter

SSTable

Bloom filter

SSTable

Bloom filter

Page 31: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading from an SSTable

• We need to know where in the file our data is saved

• Keys are sorted, why don’t we do a binary search?

• Keys are not all the same size

• Jumping around in a file is very slow

• Log2(N) random I/O, ~20 for 1 million keys

Page 32: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading from an SSTableLet’s index key ranges in the SSTable

SSTable

Key: k-128 Key: k-256 Key: k-384

Position: 12098 Position: 23445 Position: 43678

Page 33: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

SSTable index• Saved with each SSTable

• Stores key ranges and their offsets: [(Key, Offset)]

• Saved on disk but kept in memory

• Avoids searching for a key by scanning the file

• Configurable key interval (default: 128)

Page 34: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

SSTable index

Page 35: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Sometimes not enough

• Storing key ranges is limited

• We can do better by storing the exact offset

• This saves approximately one I/O

Page 36: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter Key cache

The key cache

Page 37: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Key cache

• Stores the exact location in the SSTable

• Stored in heap

• Avoids having to scan a whole index interval

• Size is configurable per column family

Page 38: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 39: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 40: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 41: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 42: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 43: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 44: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

MemtableIn the JVM

On disk SSTableCommit log

SSTableindexBloom filter

Row cacheOff-heap (no GC)

Key cache

1

2

3 4 5

6

Page 45: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Distributed counters

Page 46: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Distributed counters

• 64-bit signed integer, replicated in the cluster

• Atomic inc and dec by an arbitrary amount

• Counting with read-inc-write would be inefficient

• Stored differently from regular columns

Page 47: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Consider a clusterwith 3 nodes, RF=3

Page 48: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Internal counter data• List of increments received by the local node• Summaries (Version,Sum) sent by the other nodes• The total value is the sum of all counts

Page 49: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Internal counter data• List of increments received by the local node• Summaries (Version,Sum) sent by the other nodes• The total value is the sum of all counts

node

Local increments

Received from

Received from

+5

version: 3count: 5

version: 5count: 10

+2 -3

Page 50: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Incrementing a counter• A coordinator node is chosen (blue node)

Local increments +5 +2 -3

Page 51: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Incrementing a counter• A coordinator node is chosen

• Stores its increment locally

Local increments +5 +2 -3 +1

Page 52: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Incrementing a counter• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments

Local increments +5 +2 -3 +1

Page 53: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Incrementing a counter• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments

• Forwards a summary to other replicas: (v.4, sum 5)

Local increments +5 +2 -3 +1

Page 54: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Incrementing a counter• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments

• Forwards a summary to other replicas

• Replicas update their records:

Received from version: 4count: 5

Page 55: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading a counter

• Replicas return their counts and versions

• Including what they know about other nodes

• Only the most recent versions are kept

Page 56: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading a counter

version: 6count: 12

Page 57: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading a counter

version: 6count: 12

{ v. 3, count 5v. 6, count 12v. 2, count 8

{ v. 3, count 5v. 5, count 10v. 4, count 5

Page 58: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Reading a counter

version: 6count: 12

{ v. 3, count 5v. 6, count 12v. 2, count 8

{ v. 3, count 5v. 5, count 10v. 4, count 5

Counter value: 5 + 12 + 5 = 22

Page 59: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Storage problems

Page 60: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Tuning• Cassandra can’t really use large amounts of RAM

• Garbage Collection pauses stop everything

• Compaction has an impact on performance

• Reading from disk is slow

• These limitations restrict the size of each node

Page 61: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Recap• Fast sequential writes

• ~1 I/O for uncached reads, 0 for cached

• Counter increments read on write, columns don’t

• Know where your time is spent (monitor!)

• Tune accordingly

Page 62: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Questions?

http://www.flickr.com/photos/kubina/326628918/sizes/l/in/photostream/http://www.flickr.com/photos/alwarrete/5651579563/sizes/o/in/photostream/http://www.flickr.com/photos/pio1976/3330670980/sizes/o/in/photostream/http://www.flickr.com/photos/lwr/100518736/sizes/l/in/photostream/

Page 63: Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

• In-kernel backend• No Garbage Collection• No need to plan heavy compactions• Low and consistent latency• Full versioning, snapshots• No degradation with Big Data