introduction to apache cassandra
TRANSCRIPT
Apache Cassandra
Harshit DagaSoftware ConsultantKnoldus Software LLP
Agenda
What is Cassandra
Gossip communication protocol
Cassandra- Data Model
Cassandra- Architecture
Reading/Writing a node
Data consistency
Cassandra
Cassandra is massively scalable schemaless database.
Open source database, licensed under Apache.
Originally, developed by Facebok for inbox search.
Data model based upon Googles BigTable.
Distributed design is based upon Amazon Dynamo.
Promoted massively by Datastax.
Gossip Communication Protocol
Peer to peer communication protocol.
Nodes are arranged in ring format.
Data is replicated to multiple nodes.
Nodes periodically exchange info. they have.
Nodes also exchange their own info.
Each message has its associated version.
No master-slave concept, and hence no single point of failure.
Cassandra- Data Model
Column data is stored as in key/value pair.
Collection of column makes a Row.
Column family is then becomes as collection of all rows.
In RDBMS, each column must have some value else NULL, but not in case of cassandra database.
Cassandra- Data Model
Consider following example,
Now inserting a new row:
Above insertion would not fail.
Cassandra- Data Model
It means, data are stored as multi-dimensional sparse array.
Cassandra- Architecture
A ring has several nodes.
Each node is assigned a Partition value.
Data processing is based on the Partition Key.
When a client makes a request to a node, it becomes the coordinator for that request.
The coordinator determines which node in the ring should process upon that request.
Cassandra- Architecture
Virtual Nodes (Vnodes)Responsible for assigning the partition token range.
Tokens are automatically calculated & assigned to each node.
Cluster re-balancing is done automatically.
Cassandra- Architecture
Which node gets what data is based on the partition key.
Cassandra assigns a hash value to each partition key.
And data gets to a node as per the hash value
Cassandra- Architecture
How write request gets fulfilled:-
Data Replication
Data replicationSimple StrategyUsed for only one cluster
Network Topology StrategyUsed for multiple clusters in multiple data centers.
Writing data in a Node
Write an entry in the commit log
Write data to memtable.
When memtable is full, Store data on disk in SSTables.
SSTables are immutable data structure.
Also has a support for TTL.
Cassandra is the fastest db in concern with the write operation
Reading data from a Node
First, checks the memtable using Bloom filter.
If found, then data is sent as response.
Else, fetch the data from the SSTables.
Cassandra may write many versions of the same row, then how to identify the latest one?
Update/Delete data from Node
Data is not immediately deleted.
It is marked to be deleted/updated in memtables.
This process is called tombstone.
Tombstone, runs at configured interval of time.
During each interval, it collects all the SSTables and updates the marked record and discards the old SSTables.
Data Consistency
Data is not necessarily on every node all the time.
For maintaining consistency, no. of replicas should respond:ONE
QUORUM
ALL
Consistency has major impact on performance.
For strong consistency:R + W > N
References
Oreilly- Cassandra Definitive Guide
https://cassandra.apache.org/doc/latest/
http://docs.datastax.com/en/cassandra/3.0/
Thank You !!