cassandra tk 2014 - large nodes
DESCRIPTION
A discussion of running cassandra with a large data load per node.TRANSCRIPT
![Page 1: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/1.jpg)
CASSANDRA TK 2014
LARGE NODES WITH CASSANDRA
Aaron Morton @aaronmorton
!
Co-Founder & Principal Consultant www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
![Page 2: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/2.jpg)
About The Last Pickle. Work with clients to deliver and improve
Apache Cassandra based solutions.
Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid
Committer. Based in New Zealand & USA.
![Page 3: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/3.jpg)
Large Node? !
“Avoid storing more than 500GB per node” !
(Originally said about EC2 nodes.)
![Page 4: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/4.jpg)
Large Node? !
“You may have issues if you have over 1 Billion keys per node.”
![Page 5: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/5.jpg)
Before version 1.2 large nodes had operational and
performance concerns.
![Page 6: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/6.jpg)
After version 1.2 large nodes have fewer operational and
performance concerns.
![Page 7: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/7.jpg)
Issues Pre 1.2 Work Arounds Pre 1.2
Improvements 1.2 to 2.1 !
![Page 8: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/8.jpg)
Memory Management. Some in memory structures grow with number of rows and size of
data.
![Page 9: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/9.jpg)
Bloom Filter Stores bitset used to determine if a key exists
in an SSTable with a certain probability. !
Size depends on number of rows and bloom_filter_fp_chance.
![Page 10: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/10.jpg)
Bloom Filter Allocates pages of 4096 longs in a
long[][] array.
![Page 11: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/11.jpg)
Bloom Filter Size Bl
oom
File
r Size
in M
B
0
300
600
900
1,200
Millions of Rows
1 10 100 1,000
0.01 bloom_filter_fp_chance 0.10 bloom_filter_fp_chance
![Page 12: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/12.jpg)
Compression Metadata Stores long offset into compressed -
Data.db file for each chunk_length_kb (default 64) of uncompressed data.
!
Size depends on the uncompressed data size.
![Page 13: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/13.jpg)
Compression Metadata Allocates pages of 4096 longs in a
long[][] array.
![Page 14: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/14.jpg)
Compression Metadata SizeCo
mpr
ess M
etad
ata
Size
in M
B
0
350
700
1,050
1,400
Uncompressed Size in GB
1 10 100 1,000 10,000
Snappy Compressor
![Page 15: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/15.jpg)
Index Samples Stores offset into -Index.db for every
index_interval (128) keys. !
Size depends on the number of rows and the size of the keys.
!
![Page 16: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/16.jpg)
Index Samples Allocates long[] for offsets and byte[]
[] for row keys. !
(Version 1.2 using on heap structures)
![Page 17: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/17.jpg)
Index Samples Total SizeIn
dex
Sam
ple T
otal
Size
in M
B
0
75
150
225
300
Millions of Rows
1 10 100 1,000
Position Offset Keys (25 bytes long)
![Page 18: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/18.jpg)
Memory Management. Larger Heaps (above 8GB) take
longer to GC. !
Large working set results in frequent prolonged GC.
![Page 19: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/19.jpg)
Bootstrap. The joining node requests data from one replica of each token range it will own.
!
Sending is throttled by stream_throughput_outbound_mega
bits_per_sec (default 200/25MB).
![Page 20: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/20.jpg)
Bootstrap. With RF 3, only three nodes will send data to
a bootstrapping node. !
Maximum send rate is 75 MB/sec (3*25MB).
![Page 21: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/21.jpg)
Moving Nodes. Copy data from existing node to new node.
!
At 50 MB/s transferring 100GB takes 33 minutes.
![Page 22: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/22.jpg)
Disk Management. Need a multi TB volume or use multiple
volumes.
![Page 23: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/23.jpg)
Disk Management with RAID-0. Single disk failure results in total node failure.
![Page 24: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/24.jpg)
Disk Management with RAID-10. Requires double the raw capacity.
![Page 25: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/25.jpg)
Disk Management with Multiple Volumes. Specified via data_file_directories
!
Write load not distributed. !
Single failure will shut down node.
![Page 26: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/26.jpg)
Repair. Compare data between nodes and exchange
differences. !
![Page 27: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/27.jpg)
Comparing Data for Repair. Calculate Merkle Tree hash by reading all
rows in a Table. (Validation Compaction)
!
Single comparator, throttled by compaction_throughput_mb_per_sec
(default 16).
![Page 28: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/28.jpg)
Comparing Data for Repair. Time taken grows as the size of the data per
node grows.
![Page 29: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/29.jpg)
Exchanging Data for Repair. Ranges of rows with differences are
Streamed. !
Sending is throttled by stream_throughput_outbound_mega
bits_per_sec (default 200/25MB).
![Page 30: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/30.jpg)
Compaction. Requires free space to write new SSTables.
![Page 31: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/31.jpg)
SizeTieredCompactionStrategy. Groups SSTables by size, assumes no
reduction in size. !
In theory requires 50% free space, in practice can work beyond 50% though not
recommended.
![Page 32: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/32.jpg)
LeveledCompactionStrategy. Groups SSTables by “level” and groups row
fragments per level. !
Requires approximately 25% free space.
![Page 33: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/33.jpg)
Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1
!
![Page 34: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/34.jpg)
Memory Management Work Arounds. Reduce Bloom Filter size by increasing
bloom_filter_fp_chance from 0.01 to 0.1.
!
May increase read latency.
![Page 35: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/35.jpg)
Memory Management Work Arounds. Reduce Compression Metadata size by
increasing chunk_length_kb. !
May increase read latency.
![Page 36: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/36.jpg)
Memory Management Work Arounds. Reduce Index Samples size by increasing
index_interval to 512. !
May increase read latency.
![Page 37: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/37.jpg)
Memory Management Work Arounds. When necessary use a 12GB
MAX_HEAP_SIZE. !
Keep HEAP_NEWSIZE “reasonable” e.g. less than 1200MB.
![Page 38: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/38.jpg)
Bootstrap Work Arounds. Increase streaming throughput via
nodetool setstreamthroughput whenever possible.
![Page 39: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/39.jpg)
Moving Node Work Arounds. Copy nodetool snapshot while the
original node is operational. !
Copy only a delta when the original node is stopped.
![Page 40: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/40.jpg)
Disk Management Work Arounds. Use RAID-0 and over provision nodes
anticipating failure. !
Use RAID-10 and accept additional costs.
![Page 41: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/41.jpg)
Repair Work Arounds. Only use if data is deleted, rely on Consistently Level for distribution.
!
Frequent small repair using token ranges.
![Page 42: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/42.jpg)
Compaction Work Arounds. Over provision disk capacity when using SizeTieredCompactionStrategy.
!
Reduce min_compaction_threshold (default 4) max_compaction_threshold (default 32) to reduce number of SSTables per compaction.
![Page 43: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/43.jpg)
Compaction Work Arounds. Use LeveledCompactionStrategy
where appropriate.
![Page 44: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/44.jpg)
Issues Pre 1.2 Work Arounds Pre 1.2
Improvements 1.2 to 2.1
![Page 45: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/45.jpg)
Memory Management Improvements. Version 1.2 moved Bloom Filters and
Compression Meta Data off the JVM Heap to Native Memory.
!
Version 2.0 moved Index Samples off the JVM Heap.
![Page 46: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/46.jpg)
Bootstrap Improvements. Virtual Nodes increases the number of Token
Ranges per node from 1 to 256. !
Bootstrapping node can request data from 256 different nodes.
![Page 47: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/47.jpg)
Disk Layout Improvements. “JBOD” support distributes concurrent
writes to multiple data_file_directories.
![Page 48: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/48.jpg)
Disk Layout Improvements. disk_failure_policy adds support for
handling disk failure. !
ignore stop
best_effort
![Page 49: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/49.jpg)
Repair Improvements. “Avoid repairing already-repaired data by
default” CASSANDRA-5351 !
Scheduled for 2.1
![Page 50: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/50.jpg)
Compaction Improvements. “Avoid allocating overly large bloom filters”
CASSANDRA-5906 !
Included in 2.1
![Page 51: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/51.jpg)
Thanks. !
![Page 52: Cassandra TK 2014 - Large Nodes](https://reader034.vdocuments.mx/reader034/viewer/2022042715/557d80e6d8b42a58788b4d0d/html5/thumbnails/52.jpg)
Aaron Morton @aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License