cassandra basics 2.0

Introduction to Cassandra

Asis Mohanty

Source: DataStax, Cassandra

01 ARCHITECTURE

Cassandra Architecture

02 C* CLUSTER

Cassandra Cluster

03 CQL COLUMN

FAMILY

CQL Column Family

WRITE PATH 04 Cassandra Write Path

READ PATH 05 Cassandra Read Path

DATA

MODEL 06 C* Data Model

Whats is Cassandra?

• Open-source database management system (DBMS)

• Several key features of Cassandra differentiate it from other similar systems

Whats is Cassandra?

Architected Scale in Mind

What is C* Cluster?

Peer To Peer

Data Centers

Tunable Consistency

Continuous Availability

Consistent Hashing

Consistent hashing allows distributing data across a cluster which minimizes reorganization when nodes are added or removed. Consistent hashing partitions data based on the partition key

name age car gender

jim 36 camaro M

carol 37 bmw F

johnny 12 M

suzy 10 F

For Example

Partition key Murmur3 hash value

jim -2245462676723223822

carol 7723358927203680754

johnny -6723372854036780875

suzy 1168604627387940318

Cassandra assigns a hash value to each partition key

What is a CQL table and how is it related to a column family?

Row is the smallest unit that stores related data in Cassandra

• Rows – individual rows constitute a column family

• Row key – uniquely identifies a row in a column family

• Row – stores pairs of column keys and column values

• Column key – uniquely identifies a column value in a row

• Column value – stores one value or a collection of values

What are row, row key, column key and column value?

What are partition, partition key, row, column and cell?

How does Cassandra writes so fast?

Cassandra is a log-structured storage engine

• Data is sequentially appended, not placed in pre-set locations

What are the key components of the write path?

Each node implements four key components to handle its writes

Memtables – in-memory tables corresponding to CQL tables, with

indexes

CommitLog – append-only log, replayed to restore downed node's

Memtables

SSTables – Memtable snapshots periodically flushed to disk, clearing

heap

Compaction – periodic process to merge and streamline SSTables

When any node receive any write request

The record appends to the CommitLog, and

The record appends to the Memtable for this record's target CQL table

Periodically, Memtables flush to SSTables, clearing JVM heap and

CommitLog

Periodically, Compaction runs to merge and streamline SSTables

How does the write path flow on a node?

What are Memtables and how are they flushed to disk?

What is a SSTable and what are its characteristics?

What is compaction?

How does the read path flow on each node?

What is a data modeling framework?

A Sample Data Model

What is a conceptual data model?

Partitioning

• Nodes are logically structured in Ring Topology.

• Hashed value of key associated with data partition is used to assign it to a node in the ring.

• Hashing rounds off after certain value to support ring structure.

• Lightly loaded nodes moves position to alleviate highly loaded

nodes.

33

Appendix

Consistency – All the

servers in the system will

have the same data so

anyone using the system will

get the same copy

regardless of which server

answers their request.

Availability – The system

will always respond to a

request (even if it's not the

latest data or consistent

across the system or just a

message saying the system

isn't working)

Partition Tolerance – The

system continues to operate

as a whole even if individual

servers fail or can't be

reached..

CAP Theorem

Cassandra Architecture Overview

○ Cassandra was designed with the understanding that system/ hardware failures

can and do occur

○ Peer-to-peer, distributed system

○ All nodes are the same

○ Data partitioned among all nodes in the cluster

○ Custom data replication to ensure fault tolerance

○ Read/Write-anywhere design

○ Google BigTable - data model

○ Column Families

○ Memtables

○ SSTables

○ Amazon Dynamo - distributed systems technologies

○ Consistent hashing

○ Partitioning

○ Replication

○ One-hop routing

Transparent Elasticity

Nodes can be added and removed from Cassandra online, with no downtime being experienced.

1

2

3

4

5

6

1

7

10 4

2

3

5

6

8

9

11

12

Transparent Scalability

Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.

1

2

3

4

5

6

1

7

10 4

2

3

5

6

8

9

11

12

Performance throughput = N

Performance throughput = N x 2

High Availability

Cassandra, with its peer-to-peer architecture has no single point of failure.

Multi-Geography/Zone Aware

Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.

Data Redundancy

Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).

uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for

Security in Cassandra

• Internal Authentication

Manages login IDs and passwords inside the database.

• Object Permission Management

Controls who has access to what and who can do what in the

database

Uses familiar GRANT/REVOKE from relational systems.

• Client to Node Encryption

Protects data in flight to and from a database

cassandra basics 2.0

Technology

column values column

column family row key

key features of cassandra

data partition

row column value stores

path data model

path04 cassandra

cluster cassandra cluster