![Page 1: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/1.jpg)
Cassandra and Spark: Optimizing for Data LocalityRussell SpitzerSoftware Engineer @ DataStax
![Page 2: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/2.jpg)
Lex Luther Was Right: Location is Important
The value of many things is based upon it's location. Developed land near the beach is valuable but desert and farmland is generally much cheaper. Unfortunately moving land is generally impossible.
Spark Summit
![Page 3: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/3.jpg)
or lake, or swamp or whatever body of water is "Data" at the time this slide is viewed.
Lex Luther Was Wrong: Don't ETL the Data Ocean
Spark Summit
My House
![Page 4: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/4.jpg)
Spark is Our Hero Giving us the Ability to Do Our Analytics Without the ETL
PARK
![Page 5: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/5.jpg)
Moving Data Between Machines Is Expensive Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
![Page 6: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/6.jpg)
Moving Data Between Machines Is Expensive Do Work Where the Data Lives!
Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
Metropolis
Superman
Gotham
Batman
Spark Executors
![Page 7: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/7.jpg)
DataStax Open Source Spark Cassandra Connector is Available on Github
https://github.com/datastax/spark-cassandra-connector
• Compatible with Spark 1.3 • Read and Map C* Data Types • Saves To Cassandra • Intelligent write batching • Supports Collections • Secondary index pushdown • Arbitrary CQL Execution!
![Page 8: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/8.jpg)
How the Spark Cassandra Connector Reads Data Node Local
![Page 9: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/9.jpg)
Cassandra Locates a Row Based on Partition Key and Token Range
All of the rows in a Cassandra Cluster are stored based based on their location in the Token Range.
![Page 10: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/10.jpg)
Metropolis
Gotham
Coast CityEach of the Nodes in a Cassandra Cluster is primarily responsible for one set of Tokens.
0999
500
Cassandra Locates a Row Based on Partition Key and Token Range
![Page 11: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/11.jpg)
Metropolis
Gotham
Coast CityEach of the Nodes in a Cassandra Cluster is primarily responsible for one set of Tokens.
0999
500
750 - 99
350 - 749
100 - 349
Cassandra Locates a Row Based on Partition Key and Token Range
![Page 12: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/12.jpg)
Jacek 514 Red
The CQL Schema designates at least one column to be the Partition Key.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on Partition Key and Token Range
![Page 13: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/13.jpg)
Jacek 514 Red
The hash of the Partition Key tells us where a row should be stored.
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on Partition Key and Token Range
830
![Page 14: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/14.jpg)
With VNodes the ranges are not contiguous but the same mechanism controls row location.
Jacek 514 Red
Metropolis
Gotham
Coast City
Cassandra Locates a Row Based on Partition Key and Token Range
830
![Page 15: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/15.jpg)
Loading Huge Amounts of Data
Table Scans involve loading most of the data in Cassandra
![Page 16: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/16.jpg)
Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
![Page 17: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/17.jpg)
Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
Spark Partition
![Page 18: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/18.jpg)
Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
![Page 19: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/19.jpg)
Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
![Page 20: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/20.jpg)
Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)
Token Ranges
spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
CassandraRDD
![Page 21: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/21.jpg)
Spark Driver
Spark Partitions Are Annotated With the Location For TokenRanges they Span
Metropolis
Metropolis
Superman
Gotham
Batman
Coast City
Green L.
The Driver waits spark.locality.wait for the preferred location to have an open executor
Assigns Task
![Page 22: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/22.jpg)
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance
On the Executor the task is transformed into CQL queries which are executed via the Java Driver.
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
![Page 23: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/23.jpg)
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
![Page 24: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/24.jpg)
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance
SELECT * FROM keyspace.table WHERE (Pushed Down Clauses) AND
token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
Because we are utilizing CQL we can also pushdown predicates which can be handled by C*.
![Page 25: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/25.jpg)
Loading Sizable But Defined Amounts of Data
Retrieving sets of Partition Keys can be done in parallel
![Page 26: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/26.jpg)
joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
Generic RDDMetropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
![Page 27: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/27.jpg)
Generic RDDMetropolis
Superman
Gotham
Batman
Coast City
Green L.
Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
![Page 28: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/28.jpg)
repartitionByCassandraReplica Repartitions RDD's to be C* Local
Generic RDD CassandraPartitionedRDD
This operation requires a shuffle
![Page 29: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/29.jpg)
joinWithCassandraTable on CassandraPartitionedRDDs (or CassandraTableScanRDDs) will be Node local
CassandraPartitionedRDDs are partitioned to be executed node local
CassandraPartitionedRDDMetropolis
Superman
Gotham
Batman
Coast City
Green L.
![Page 30: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/30.jpg)
Metropolis
Spark Executor (Superman)
The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance
The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
SELECT * FROM keyspace.table WHERE pk =
![Page 31: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/31.jpg)
DataStax Enterprise Comes Bundled with Spark and the Connector
Apache Spark Apache Solr
DataStax Delivers Apache Cassandra In A Database Platform
![Page 32: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/32.jpg)
DataStax Enterprise Enables This Same Machinery with Solr Pushdown
Metropolis
Spark Executor (Superman)
DataStax Enterprise
SELECT * FROM keyspace.table WHERE solr_query = 'title:b' AND token(pk) > 780 and token(pk) <= 830
Tokens 780 - 830
![Page 33: Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)](https://reader031.vdocuments.mx/reader031/viewer/2022030310/58f9a989760da3da068b6fb1/html5/thumbnails/33.jpg)
Learn More Online and at Cassandra Summithttps://academy.datastax.com/