cassandra basics 2.0
TRANSCRIPT
01 ARCHITECTURE
Cassandra Architecture
02 C* CLUSTER
Cassandra Cluster
03 CQL COLUMN
FAMILY
CQL Column Family
WRITE PATH 04 Cassandra Write Path
READ PATH 05 Cassandra Read Path
DATA
MODEL 06 C* Data Model
Whats is Cassandra?
• Open-source database management system (DBMS)
• Several key features of Cassandra differentiate it from other similar systems
Whats is Cassandra?
Consistent Hashing
Consistent hashing allows distributing data across a cluster which minimizes reorganization when nodes are added or removed. Consistent hashing partitions data based on the partition key
name age car gender
jim 36 camaro M
carol 37 bmw F
johnny 12 M
suzy 10 F
For Example
Partition key Murmur3 hash value
jim -2245462676723223822
carol 7723358927203680754
johnny -6723372854036780875
suzy 1168604627387940318
Cassandra assigns a hash value to each partition key
Row is the smallest unit that stores related data in Cassandra
• Rows – individual rows constitute a column family
• Row key – uniquely identifies a row in a column family
• Row – stores pairs of column keys and column values
• Column key – uniquely identifies a column value in a row
• Column value – stores one value or a collection of values
What are row, row key, column key and column value?
How does Cassandra writes so fast?
Cassandra is a log-structured storage engine
• Data is sequentially appended, not placed in pre-set locations
What are the key components of the write path?
Each node implements four key components to handle its writes
Memtables – in-memory tables corresponding to CQL tables, with
indexes
CommitLog – append-only log, replayed to restore downed node's
Memtables
SSTables – Memtable snapshots periodically flushed to disk, clearing
heap
Compaction – periodic process to merge and streamline SSTables
When any node receive any write request
The record appends to the CommitLog, and
The record appends to the Memtable for this record's target CQL table
Periodically, Memtables flush to SSTables, clearing JVM heap and
CommitLog
Periodically, Compaction runs to merge and streamline SSTables
Partitioning
• Nodes are logically structured in Ring Topology.
• Hashed value of key associated with data partition is used to assign it to a node in the ring.
• Hashing rounds off after certain value to support ring structure.
• Lightly loaded nodes moves position to alleviate highly loaded
nodes.
Consistency – All the
servers in the system will
have the same data so
anyone using the system will
get the same copy
regardless of which server
answers their request.
Availability – The system
will always respond to a
request (even if it's not the
latest data or consistent
across the system or just a
message saying the system
isn't working)
Partition Tolerance – The
system continues to operate
as a whole even if individual
servers fail or can't be
reached..
CAP Theorem
Cassandra Architecture Overview
○ Cassandra was designed with the understanding that system/ hardware failures
can and do occur
○ Peer-to-peer, distributed system
○ All nodes are the same
○ Data partitioned among all nodes in the cluster
○ Custom data replication to ensure fault tolerance
○ Read/Write-anywhere design
○ Google BigTable - data model
○ Column Families
○ Memtables
○ SSTables
○ Amazon Dynamo - distributed systems technologies
○ Consistent hashing
○ Partitioning
○ Replication
○ One-hop routing
Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no downtime being experienced.
1
2
3
4
5
6
1
7
10 4
2
3
5
6
8
9
11
12
Transparent Scalability
Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.
1
2
3
4
5
6
1
7
10 4
2
3
5
6
8
9
11
12
Performance throughput = N
Performance throughput = N x 2
Multi-Geography/Zone Aware
Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.
Data Redundancy
Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).
uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for
Security in Cassandra
• Internal Authentication
Manages login IDs and passwords inside the database.
• Object Permission Management
Controls who has access to what and who can do what in the
database
Uses familiar GRANT/REVOKE from relational systems.
• Client to Node Encryption
Protects data in flight to and from a database