datastax: spark cassandra connector - past, present and future

Spark Cassandra Connector: Past, Present and Future

Spark Cassandra Connector

Past, Present and Future

Russell Spitzer @RussSpitzer

Software Engineer - Datastax

The Past: Hadoop and C*

Hadoop integration with C* required a bit of knowledge and was generally not very easy.

Map Reduce Code

public static class ReducerToCassandra extends Reducer<Text, IntWritable, Map<String, ByteBuffer>, List<ByteBuffer>> { private Map<String, ByteBuffer> keys; private ByteBuffer key; protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { keys = new LinkedHashMap<String, ByteBuffer>(); }

public void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) sum += val.get(); keys.put("word", ByteBufferUtil.bytes(word.toString())); context.write(keys, getBindVariables(word, sum)); }

private List<ByteBuffer> getBindVariables(Text word, int sum) { List<ByteBuffer> variables = new ArrayList<ByteBuffer>(); variables.add(ByteBufferUtil.bytes(String.valueOf(sum))); return variables; } }

Hadoop Interfaces are … difficult

https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java

Even simple integration with a Hadoop cluster took a lot of experience to get right.

public static class ReducerToCassandra extends Reducer<Text, IntWritable, Map<String, ByteBuffer>, List<ByteBuffer>> { private Map<String, ByteBuffer> keys; private ByteBuffer key; protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { keys = new LinkedHashMap<String, ByteBuffer>(); }

public void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) sum += val.get(); keys.put("word", ByteBufferUtil.bytes(word.toString())); context.write(keys, getBindVariables(word, sum)); }

private List<ByteBuffer> getBindVariables(Text word, int sum) { List<ByteBuffer> variables = new ArrayList<ByteBuffer>(); variables.add(ByteBufferUtil.bytes(String.valueOf(sum))); return variables; } }

Hadoop Interfaces are … difficult

https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java

Well at least you have Pig built in right?moredata = load 'cql://cql3ks/compmore' USING CqlNativeStorage; insertformat = FOREACH moredata GENERATE TOTUPLE (TOTUPLE('a',x),TOTUPLE('b',y), TOTUPLE('c',z)),TOTUPLE(data); STORE insertformat INTO 'cql://cql3ks/compotable?output_query=UPDATE%20cql3ks.compotable%20SET%20d%20%3D%20%3F' USING CqlNativeStorage;

Even simple integration with a Hadoop cluster took a lot of experience to get right.

Spark Offers a New Path

Core Libraries for ML/Streaming No need for HDFS/Hadoop Easy integration with other Data Sources

val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b)

RDD Api

df.groupBy("age").count().show()

Dataframes Api

head(filter(df, df$waiting < 50))

SELECT name FROM people

SQL API

Driver

Executor

Enter The Spark Cassandra Connector

First Public Release at the Spark Summit in June 2014

If you write a Spark application that

needs access to Cassandra, this library is for you -Piotr Kołaczkowski

https://github.com/datastax/spark-cassandra-connector

Open Source Software

1394 Commits 28 Contributors

Why do we even want a Distributed Analytics tool?

•Generating Reports •Direct Analytics on our data •Cassandra Maintenance

•Making new views •Changing partition keys

•Streaming •Machine Learning •ETL Data between different sources

We have small questions and big questions and they need to work in different ways

How many shoes did Marty buy?

How many shoes were sold last year compared to this year grouped by demographic?

BIG DATA

BIG DATAMarty Purchase History

BIG DATA

All Shoe Data

Part of Shoe Data

When we actually want to work with large amounts of data we break it into parts

Distributed FS/databases already do this for us

Node1 Node2 Node3 Node4

Part of Shoe Data Part of Shoe Data Part of Shoe Data

Spark describes underlying large multi-machine sets of data using The RDD (Resilient Distributed Dataset)

Part of Shoe Data

Spark Partitions

In Cassandra this distribution is mapped out by token ranges

1 - 10000 10001-20000 20001-30000 30001 - 40000

Tokens

Part of Shoe Data

This distribution is key to how Cassandra handles OLTP Requests

SELECT amount from orders where customer = martyID

1 - 10000 10001-20000 20001-30000 30001 - 40000

Tokens

Part of Shoe Data

martyId -‐> Token -‐> 3470

Lookup Data for marty

The Connector Maps Cassandra Tokens to Spark Partitions

sc.cassandraTable("keyspace","tablename")

1 - 10000 10001-20000 30001 - 40000Tokens

Part of Shoe Data

20001-30000

00001 -

02501 -

05001 -

07501 -

CassandraRDD

10001 -

12501 -

15001 -

17501 -

20001 -

22501 -

25001 -

27501 -

30001 -

32501 -

35001 -

37501 -

This allows for Node Local operations!

sc.cassandraTable("keyspace","tablename")

1 - 10000 10001-20000 30001 - 40000Tokens

Part of Shoe Data

20001-30000

00001 -

02501 -

05001 -

07501 -

CassandraRDD

10001 -

12501 -

15001 -

17501 -

20001 -

22501 -

25001 -

27501 -

30001 -

32501 -

35001 -

37501 -

Under the Hood the Spark Cassandra Connector Uses the Java Driver to pull Information from C*

Check out my videos on Datastax Academy For a Deep Dive!

Check out Robert's Talk!

5:10 PM - 5:50 PM B1 - B3

https://academy.datastax.com/tutorials https://academy.datastax.com/demos/how-‐spark-‐cassandra-‐connector-‐reads-‐data https://academy.datastax.com/demos/how-‐spark-‐cassandra-‐connector-‐writes-‐data https://academy.datastax.com/demos/how-‐spark-‐works-‐dsestandalone-‐mode

The Present:Capabilities and Features

Official Releases for Spark 1.0 - 1.4Milestone Release for 1.5

Read Cassandra Data into RDDs Write RDDs into Cassandra

RDD[Letter]

case class Letter(mailbox: Int, body: String, fromuser: String, : touser: String)

CREATE TABLE important.letters ( mailbox int, touser text, fromuser text, body text, PRIMARY KEY ((mailbox), touser, fromuser));

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/2_loading.md

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/5_saving.md

RDD[Letter]

sc.cassandraTable[Letter]("important","letters")

RDD[Letter]

sc.cassandraTable[Letter]("important","letters")

rdd.saveToCassandra("important","letters")

Ability to push down relevant filters to the C* Server

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/3_selection.md

Partition for Mailbox 1 Partition for Mailbox 2

mailbox: 2 touser: marty fromuser: doc body: It's your kids, Marty. Something gotta be done about your kids!

mailbox: 1 touser: doc fromuser: marty body: What happens to us in the future?

mailbox: 1 touser: lorraine fromuser: marty body: Calvin? Wh… Why do you keep calling me calvin

sc.cassandraTable("important", "letters") .select("body") .where("touser = >", "einstein") .collect

Only the data we specifically request is pulled form C*

Java API Support

JavaRDD<Double> pricesRDD = javaFunctions(sc) .cassandraTable("important", "letters", mapColumnTo(Letter.class)) .select("body");

All functionality introduced in the Scala API is also available in the Java API

javaFunctions(rdd).writerBuilder( "important", "letters", mapToRow(Letters.class) ).saveToCassandra();

Reading

Writing

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/7_java_api.md

But what if you want to work with brand new Dataframes?

Full Dataframes Support : org.apache.spark.sql.cassandra

Dataframes (aka SchemaRDDs) provide a new and more generic api for working with RDD's

val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(

Map( "keyspace" -‐> "important", "table" -‐> "letters" ))

.load()

Reading

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/14_data_frames.md

.load()

CREATE TABLE letters USING org.apache.spark.sql.cassandra OPTIONS ( keyspace "important", table "letters" )

Reading

.load()

Reading

Writing

df.write .format("org.apache.spark.sql.cassandra") .options( Map( "keyspace" -‐> "important", "table" -‐> "letters" )) .save()

.load()

Reading

Writing

CREATE TABLE letters_copy USING org.apache.spark.sql.cassandra OPTIONS ( keyspace "important", table "letters_copy" )

INSERT INTO TABLE letters_copy SELECT * FROM letters;

.load()

Reading

Writing

CREATE TABLE letters_copy USING org.apache.spark.sql.cassandra OPTIONS ( keyspace "important", table "letters_copy" )

INSERT INTO TABLE letters_copy SELECT * FROM letters;

Full Dataframes Support

Backed By CassandraRDD So we can prune

and pushdown predicates!

Integrated Pushdown of Predicates to C* in Dataframes

There is no need for special functions when using Dataframes since the pushdown is done by the Catalyst optimizer

scala> df.filter( "touser > 'einstein'").explain == Physical Plan == Filter (touser#1 > einstein) PhysicalRDD [mailbox#0,touser#1,fromuser#2,body#3], MapPartitionsRDD[6] at explain at <console>:59

Automatically Checked Against C* rules for pushing down predicates. Valid predicates will be applied as if you did a

.where on CassandraRDD.

Pyspark and Dataframes Also Supported

Dataframes in PySpark run Native Code, no need for Python <-> Java Serialization

sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="kv", keyspace="test")\ .load().show()

You can tell it's python because of

my need to escape line ends

Pure Python in Pyspark PySpark Dataframes!

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/15_python.md

Pyspark and Dataframes Also Supported

sqlContext.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="kv", keyspace="test")\ .load().show()

You can tell it's python because of

my need to escape line ends

Pure Python in Pyspark PySpark Dataframes!

SparkR Also Works with Cassandra Dataframes!

Repartition by Cassandra Replica

Repartition any RDD to get Data Locality to C*!https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/2_loading.md

1955 1985 2015

Spark Partitions Located on Different Nodes than

Their Respective C* Data

1955 1985 2015

mailboxesToCheck .repartitionByCassandraReplica("important", "letters", 10)

JoinWithCassandraTable pulls specific Partition Keys From Cassandra

mailboxesToCheck .repartitionByCassandraReplica("important", "letters", 10) .joinWithCassandraTable("important","letters")

Mailbox13234Mailbox13234

Mailbox13234

Several thousand mailboxes

Mailbox13234

Mailbox13234 Mailbox2341

Mailbox13234

Mailbox13234Mailbox13234Mailbox13234

Repartition places our keys local to the data they will

retrieve

Mailbox13234

Mailbox13234 Mailbox2341

Mailbox13234

Mailbox13234Mailbox13234Mailbox13234

The Join then retrieves the rows in parallel

Manual Driver Sessions are available!

import com.datastax.spark.connector.cql.CassandraConnector

CassandraConnector(conf).withSessionDo { session => session.execute("CREATE KEYSPACE test2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") session.execute("CREATE TABLE test2.words (word text PRIMARY KEY, count int)") }

Any Connections Made through CassandraConnector will use a Connection pool (even remotely!)

CassandraConnector(conf).withSessionDo {}

Gains a handle on a running Cluster object made with

Configuration conf

Executor Thread 2

Executor Thread 3

Executor Thread1

Executor JVMCassandra Connection

Cassandra Connection

Any Connections Made through CassandraConnector will use a Connection pool (even remotely!)

Multiple threads/executor cores will end up using the same

Connection

Executor Thread 2

Executor Thread 3

Executor JVMCluster

CassandraConnector(conf).withSessionDo {}

Executor Thread1

Cassandra Connector can be used in Closures and Prepared Statements will be Cached as well

rdd.mapPartitions{ it => CassandraConnector.withSessionDo( session => ps = session.prepare(query) ) }

Reference to already created prepared statement will be used if available

Cassandra Connection

PoolExecutor Thread 2

Executor Thread 3

Executor JVMCluster

Prepared Statement CacheExecutor Thread1

What is the Future of the Spark Cassandra Connector?

The more people that contribute to the project the better it will become! We welcome any contributions or just send us a letter on the mailing list!

https://github.com/datastax/spark-‐cassandra-‐connector/blob/master/doc/FAQ.md#can-‐i-‐contribute-‐to-‐the-‐spark-‐cassandra-‐connector

Spark Packages!

http://spark-packages.org/package/datastax/spark-cassandra-connector

Update Even Faster to New Spark Versions

We'll be testing against Spark Release Candidates in the future so that we can have a compatible Spark Cassandra Connectors out the moment an official Spark Release is ready!

Even better Dataframes

Automatic integration of repartitionByCassandra and joinWithCassandraTable

Make it that any joins against Cassandra Tables are automatically detected, and if possible converted to JoinWithCassandraTable calls. No need to manually determine when you should or shouldn't use the method.

Create Cassandra Tables from Dataframes Automatically Currently all tables need to have been created in C* prior to saving, we'd like it if users could specify what kind of key they would like on their C* table and have it automatically generated on data frame writes.

Improve Spark-Cassandra-Stress

Open source tool which lets you test maximum throughput of your cluster with Spark and C*

• Write Tests • Read Tests • Streaming Tests

Includes!

Thank you

datastax: spark cassandra connector - past, present and future

Technology

cassandra and spark: optimizing for data locality-(russell...

integración de datastax de spark con cassandra

cassandra and datastax enterprise on pcf

a cassandra + solr + spark love triangle using datastax...

cassandra day chicago 2015: introduction to apache cassandra...

datastax geeknet webinar - apache cassandra: enterprise...

solr & cassandra: searching cassandra with datastax...

a guide to stress testing kafka, spark and cassandra … ·...

cassandra day chicago 2015: the synergy between apache...

beyond the query: a cassandra + solr + spark love triangle...

spark & cassandra at datastax meetup on jan 29, 2015

datastax distribution of apache cassandra · distribution...

cassandra day chicago 2015: datastax enterprise & apache...

datastax certificate - introduction to apache cassandra...

datastax: extreme cassandra optimization: the sequel

state of cassandra, 2012 - nosql | apache cassandra ·...

datastax cassandra + spark streaming

datastax | data science with datastax enterprise (brian...

cassandra internals: the read path (tyler hobbs, datastax) |...

datastax | dse: bring your own spark (with enterprise...