cassandra & python - springfield mo user group

Cassandra & PythonAdam Hutson, Data Architect

@adamhutson

Who am I and What do we do?

• Adam Hutson• Data Architect of DataScale -> www.datascale.io• DataStax MVP for Apache Cassandra• DataScale provides hosted data platforms as a service• Offering Cassandra & Spark, with more to come• Currently hosted in Amazon or Azure

http://www.datascale.io/

Fun Fact

• DataScale was purchased by DataStax last week•It was publically made official today … Surprise!

Cassandra Overview

What is Big Data?

Small Data - flat file, script-based

Medium Data - single server; typical RDBMS; ACID

Big Data - multiple servers; replication = lag; sharding = management headache; expensive machines; no longer ACID; lots of people to maintain

Cassandra Overview

Distributed, database management system

Peer-to-peer design (no master, no slaves)

Can run on commodity machines

No single point of failure

Has linear scalability

Is a cluster/ring of equal machines divided into a ring of hash values

Chooses Availability & Partition over Consistency; Is eventually consistent

Data is replicated automatically

Tunable consistency at the query level

Can span multiple data centers and multiple physical locations

Cassandra Origins

Based on Amazon’s Dynamo and Google’s BigTable

Created at Facebook for the Inbox search system in 2007

Facebook open-sourced on Google code in 2008

Became an Apache Incubator project in 2009

By 2010, it graduated to a top-level project at Apache

Apache Cassandra can be run completely royalty-free

DataStax offers a licensed/corporate version with additional tools/integrations

Why Cassandra over traditional RDBMS?

RDBMS Pros:

Single Machine

ACID guarantees

Scales vertical (bigger machine)

RDBMS Cons:

Growing past Single Machine

Scale horizontal = Replication lag

Sharding = Complicated codebase

Failover = on-call headache

Why Cassandra over traditional RDBMS?

Cassandra Pros:

Scale horizontal

Shard-less

Failure indifferent

CAP: AP with tunable C

Cassandra Cons:

Eventually Consistent

ACID

Atomicity = All or nothing transactionsConsistency = Guarantees committed transaction stateIsolation = Transactions are independentDurability = Committed data is never lost

Nice warm, fuzzy that transactions are perfect.

Cassandra follows the A, I, & D in ACID, but not C; is Eventually Consistent

In NoSQL, we make trade-offs to serve the greatest need

CAP Theorem

Consistency = all nodes see the same data at the same timeAvailability = every request gets response about whether it succeeded or failedPartition Tolerance = the system continues to operate despite partition failures

You have to choose 2: CA, CP, or AP

Cassandra chooses AP; To be highly available in a network partition

Architecture Terms

Nodes: where data is stored; typically a logical machine

Data Center: collection of nodes; assigned replication; logical workload grouping; should not span physical location

Cluster: one or more data center(s); can span physical location

Commit Log: first stop for any write; provides durability

Table: collection of ordered columns fetched by row;

SSTable: sorted string table; immutable file; append only; sequentially stored

Architecture Components

Gossip: peer-to-peer communication protocol; discovers/shares data & locations

Partitioner: determines how to distribute data to replicas

Replication Factor: determines how many replicas to maintain

Replica Placement Strategy: determines which node(s) will contain replicas

Snitch: defines groups of nodes into data centers & racks that replication uses

What is a Node?

Just a small part of a big system

Represents a single machine (Server/VM/Container)

Has a JVM running the Cassandra Java process

Can run anywhere (RaspberryPi/laptop/cloud/on-premise)

Responsible for writing & reading it’s data

Typically 3-5,000 tps/core & 1-3 TB of data

Cluster

A cluster is a bunch of nodes that together are responsible for all the data.

Each node is responsible for a different range of the data, aka the token ranges.

A cluster can hold token values from -263 to 263-1.

The ring starts with the smallest number and circles clockwise to the largest.

We are using 0-99to represent tokens.

Because -263 to 263-1 is too hard to visualize

Node 1 is responsiblefor tokens 76-0




Replication

Replication is when we store replicas of data on multiple nodes to ensure reliability and fault tolerance. The total number of replicas is the replication factor.

The cluster just shown had the most basic replication factor of 1 (RF=1).

Each node was responsible for its only own data.

What happens if a node is lost/corrupt/offline? → We need replicas.

Change the replication factor to 2 (RF=2)

Each node will be responsible for its own data and its neighbors data.

We are using 0-99to represent tokens.

Because -263 to 263-1 is too hard to visualize

Replication Factor = 2

Node 1 is responsiblefor tokens 76-0 & 1-25




Replication with Node Failure

Sooner or later, we’re going to lose a node. With replication, we’re covered.

Node 1 goes offline for maintenance, or someone turned off that rack.

I need row with token value of 67, but Node 1 was responsible for that range?

Node 4 saves the day because it has a replica of Node 1’s range.

ConsistencyWe want our data to be consistent.

We have to be aware that being Available and having Partition Tolerance is more important than strictly consistent.

Consistency is tunable though. We can choose to have certain queries have stronger consistency than others.

The client can specify a Consistency Level for each read or write query issued.

Consistency LevelONE: only one replica has to acknowledge the read or write

QUORUM: 51% of the replicas have to acknowledge the read or write

ALL: all of the replicas have to acknowledge the read or write

With Multiple Data Centers:

LOCAL_ONE: only one replica in the local data center

LOCAL_QUORUM: 51% of the replicas in the local data center

Consistency & Replication working togetherScenario: 4 Nodes with Replication Factor of 3

Desire High Write Volume/Speed, Low Read frequency

Write of Consistency Level of ONE, will write to 1 node, replication will sync to other 2

Read of Consistency Level of QUORUM, will read from 2

Desire High Read Volume/Speed, Low Write frequency

Write of Consistency Level of QUORUM, will write to 2 nodes, replication will sync to other 1.

Read of Consistency Level of ONE, will read from 1 node

Peer-to-peerCassandra is not Client/Server or Master/Slave or Read-Write/Read-Only.

No routing data to shards; no holding leader elections, no split brains.

In Cassandra, every node is a peer to every other node.

Every instance of Cassandra is running the same java process.

Each node is independent and completely replaceable.

GossipOne node tells its neighbor nodes

Its neighbor nodes tell their neighbor nodes

Their neighbor nodes tell more neighbor nodes

Soon, every node knows about every other node’s business.

“Node X is down”

“Node Y is joining”

“There’s a new data center”

“The east coast data center is gone”

Hinted HandoffFailed writes will happen. When a write happens, the nodes are trying the new data to get all the replicas.

If one of the replica nodes is offline, then the other replica nodes are going to remember what data the down node was supposed to receive, aka keep hints.

When that node appears online again, then the other nodes that kept hints are going to handoff those hints to the newly online node.

Hints are kept for a tunable period, defaults to 3 hours.

Write PathNothing to do with which node the data will be written to, and everything to do with what happens internally in the node to get the data persisted.

When the data enters the Cassandra java process, the data is written to 2 places:

1. First appended to the Commit Log file.

2. Second to the MemTable.

The Commit Log is immediately durable and is persisted to disk.The MemTable is a representation of the data in RAM.

Afterwards, the write is acknowledged and goes back to the client.

Write Path (cont’d)Once the MemTable fills up, then it flushes its data to disk as an SSTable.

If the node crashes beforethe data is flushed,then the Commit Logis replayed to re-populatethe MemTable.

Read PathThe idea is that it looks in the MemTable first, then in the SSTable.

The MemTable has the most recent partitions of data.

The SSTable files is sequential, but can get really, really big.

Partition Index file keeps track of partitions and the offset of their locations in the SSTable, but they too can get large.

Summary Index file keeps track of offsets in Partition Index.

Read Path (cont’d)We can go faster by using Key Cache, which is an in-memory index of partitions and what it’s offset is in the SSTable. Skips the Partition & Summary Index files. Only works on previously requested keys.

But which SSTable/Partition/Summary file should be looked at? Bloom Filters are a probabilistic data structure. keep track by saying that the key you’re looking for is “definitely not there” or “maybe it’s there”. The false positive “maybes” are rare, but tunable.

Deleting DataData is never deleted in Cassandra. When a column value in a row, or an entire row is requested to be deleted, Cassandra actually writes additional data that marks the column or row with a timestamp that says it’s deleted. This is called a tombstone.

Whenever a read occurs, it will skip over the tombstoned data and not return it.

Skipping over tombstones still incur I/O though. There’s even a metric that will tell you the avg. tombstones being read. So we’ll need to remove them from the SSTable at some point.

CompactionCompaction is the act of merging SSTables together. But why → Housekeeping. SSTables are immutable, so we can never update the file to change a column’s value.

If your client writes the same data 3 times, there will be 3 entries in potentially 3 different SSTables. (This assumes the MemTable flushed the data in-between writes).

Reads have to read all 3 SSTables and compare the write timestamps to get the correct value.

CompactionWhat happens is that Cassandra will compact the SSTables by reading in those 3 SSTables and writing out a new SSTable with only the single entry. The 3 older SSTables will get deleted at that point.

Compaction is when tombstones are purged too. Tombstones are kept around long enough so that we don’t get “phantom deletes” though (tunable period).

Compaction will keep your future seeks on disk low.

There’s a whole algorithm on when Compaction runs, but it’s automatically setup by default.

RepairsRepairs are needed because, over time, distributed data naturally gets out of sync with all the locations. Repairs just makes sure that all your nodes are consistent.

Repairs happen at two times.

1. After each read, there is a (tunable) chance that a repair will occur. When a client requests to read a particular key, a background process will gather all the data from all the replicas, and update all the replicas to be consistent.

2. At scheduled times that are manually controlled by an admin.

Failure RecoverySometimes nodes go down due to maintenance, or a real catastrophe.

Cassandra will keep track of down nodes with gossip. Hints are automatically held for a (tunable) period. So when/ifr the node comes back online, the other nodes will tell it what it missed.

If the node doesn’t come back online, you have to create a new node to replace it. Assign it the same tokens as the lost node, and the other nodes will stream the necessary data to it from replicas.

ScalingScaling is when you add more capacity to the cluster. Typically, this is when you add more nodes.

You create a new node(s) and add it to the cluster.

A new node will join the ring and take responsibility for a part of the token ranges.

While it’s joining, other nodes will stream data for the token ranges it will own.

Once it’s fully joined, the node will start participating in normal operations.

Python with Cassandra

Python - Getting StartedInstall the python driver via pip:

pip install cassandra-driver

In a .py file, create a cluster & session object and connect to it:

from cassandra.cluster import Clustercluster = Cluster()session = cluster.connect()

The cluster object represents a cassandra cluster on your localhost.The session object will manage all connection pooling.Only create a session object once in your codebase and reuse it throughout. If not, that repeated initial connection establishment will eventually become a bottleneck.

Specifying a ClusterLocalhost is great for sandboxing, but soon a real cluster with real IPs will be needed.

cluster = Cluster(['54.211.95.95', '52.90.150.156', '52.87.198.119'])

This is the set of IPs from a demo cluster I’ve created.

AuthorizationEvery cluster should have at least PlainTextAuthProvider enabled.

import cassandra.authmy_auth_provider = cassandra.auth.PlainTextAuthProvider( username='adamhutson', password='P@$$w0rD')

There are other methods of authorization, but this is the most common.Add the above snippet before you create your Cluster object.Then pass the my_auth_provider object to the Cluster’s auth_provider option key.

cluster = Cluster(auth_provider=my_auth_provider)

Keyspace SelectionThere are 3 ways to specify a keyspace to use:

1. session = cluster.connect('my_keyspace')

2. session.set_keyspace('my_keyspace')

3. session.execute('select * from my_keyspace.my_table')

It doesn’t matter which way you chose to go, just be consistent with your selection.

Personally, I use choice #2, as I can run it before I interact with the database. It keeps me from pinning myself to a single keyspace at session creation time. It also keeps me from having to type out the keyspace name every time I write a DML statement.

Simple StatementThe first thing most will want to do is select some data out of Cassandra. Let’s use the system keyspace, and retrieve some of the meta data about the cluster.

session.set_keyspace('system')

rows = session.execute('SELECT source_id, date, event_time, event_value FROM time_series WHERE source_id = ''123'' and date = ''2016-09-01''')

for row in rows: print row.source_id, row.date, row.event_time, row.event_value

Prepared/Bound StatementEvery time we run that Simple Statement from above, Cassandra has to compile the query. What if you’re going to run the same select repeatedly. That compile time will become a bottleneck.

session.set_keyspace('training')

prepared_stmt = session.prepare('SELECT source_id, date, event_time, event_value FROM time_series WHERE source_id = ? and date = ?')

bound_stmt = prepared_stmt.bind(['123','2016-09-01'])

rows = session.execute(bound_stmt)

for row in rows: print row.source_id, row.date, row.event_time, row.event_value

Batch StatementA batch level of operations where they are applied atomically. Specify a BatchType

insert_user = session.prepare("INSERT INTO users (name, age) VALUES (?, ?)")

batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)

for (name, age) in users_to_insert: batch.add(insert_user, (name, age))

session.execute(batch)

Be careful. This can potentially be inserting to token ranges all over the cluster. Best practice is to use Batches for inserting multiples into the same partition.

Consistency LevelConsistency Level can be specified at the query level.Just need to import the necessary library and set it.This setting will remain with the session until you destroy the object or set it to a different CL.

from cassandra import ConsistencyLevel

session.default_consistency_level = ConsistencyLevel.ONE

There are a bunch of session level options that you can specify.

There are most of the same options available at the Statement level.

ShutdownThis is so simple, but so important. Finish every script with the following:

cluster.shutdown()

If you don’t do this at the end of your python file, you will leak connections on the server side.

I’ve done it, and it was completely embarrassing. Learn from my mistakes. Don’t forget it.

Thank You!

Questions?

Adam [email protected]@datastax.com

@AdamHutson@DataScaleInc@DataStax

mailto:[email protected]

mailto:[email protected]

cassandra & python - springfield mo user group

Engineering