escape from hadoop: spark one liners for c* ops
Post on 26-Jun-2015
1.221 Views
Preview:
DESCRIPTION
TRANSCRIPT
Escape From Hadoop: Spark One Liners for C* Ops
Kurt Russell Spitzer DataStax
Who am I?• Bioinformatics Ph.D from UCSF
• Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!!
• Spends a lot of time spinning up clusters on EC2, GCE, Azure, …http://www.datastax.com/dev/blog/testing-cassandra-1000-nodes-at-a-time
• Developing new ways to make sure that C* Scales
Why escape from Hadoop?
HADOOP
Many Moving Pieces
Map Reduce
Lots of Overhead
And there is a way out!
Single Points of Failure
Spark Provides a Simple and Efficient framework for Distributed Computations
Node Roles 2In Memory Caching Yes!
Generic DAG Execution Yes!Great Abstraction For Datasets? RDD!
Spark Worker
Spark Worker
Spark Master
Spark Worker
Resilient Distributed Dataset
Spark Executor
Spark is Compatible with HDFS, Parquet, CSVs, ….
Spark is Compatible with HDFS, Parquet, CSVs, ….
AND
APACHE CASSANDRA
ApacheCassandra
Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database
Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
Apache Cassandra Architecture is Very Simple
Replication
Node Roles 1
Replication Tunable
Consistency Tunable
C*
C*C*
C*
Client
DataStax OSS Connector Spark to Cassandra
https://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled and Supported with DSE 4.5!
Spark Cassandra Connector uses the DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token Range
Each Executor Maintains a connection to the C* Cluster
Spark Executor
DataStax Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different splits based on sets of tokens
Co-locate Spark and C* for Best Performance
C*
C*C*
C*
Spark Worker
Spark Worker
Spark Master
Spark WorkerRunning Spark Workers
on the same nodes as your C* Cluster will save network hops when reading and writing
Setting up C* and Spark
DSE > 4.5.0 Just start your nodes with
dse cassandra -k
Apache Cassandra Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
We need a Distributed System For Analytics and Batch Jobs
But it doesn’t have to be complicated!
Even count needs to be distributed
You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or
we could just do one liners on the spark shell.
Ask me to write a Map Reduce for word count, I dare you.
Basics: Getting a Table and Counting
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' );
Basics: Getting a Table and Counting
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' );
scala> sc.cassandraTable(“newyork","presidentlocations")
cassandraTable
Basics: Getting a Table and Counting
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' );
scala> sc.cassandraTable(“newyork","presidentlocations") .count res3: Long = 10
cassandraTable
count10
Basics: take() and toArrayscala> sc.cassandraTable("newyork","presidentlocations")
cassandraTable
Basics: take() and toArrayscala> sc.cassandraTable("newyork","presidentlocations").take(1) !res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC})
cassandraTable
take(1)
9 NYC
Array of CassandraRows
Basics: take() and toArrayscala> sc.cassandraTable("newyork","presidentlocations").take(1) !res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC})
cassandraTable
take(1)
9 NYC
Array of CassandraRows
cassandraTable
scala> sc.cassandraTable(“newyork","presidentlocations")
Basics: take() and toArrayscala> sc.cassandraTable("newyork","presidentlocations").take(1) !res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC})
cassandraTable
take(1)
9 NYC
Array of CassandraRows
cassandraTable
toArray9 NYC
Array of CassandraRows
scala> sc.cassandraTable(“newyork","presidentlocations").toArray !res3: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1})
9 NYC9 NYC9 NYC9 NYC
Basics: Getting Row Values out of a CassandraRow
scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") !res5: Int = 9
cassandraTable
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Basics: Getting Row Values out of a CassandraRow
scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") !res5: Int = 9
cassandraTable
take(1)
9 NYC
Array of CassandraRows
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Basics: Getting Row Values out of a CassandraRow
scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") !res5: Int = 9
cassandraTable
take(1)
9 NYC
Array of CassandraRows
9get[Int]
get[Int]get[String] … get[Any]
Got Null ? get[Option[Int]]
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
get[String]get[Int]
1,president,white house
C*
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
get[String]get[Int]
1,president,white house
saveToCassandra
C*
Copy A TableSay we want to restructure our table or add a new column?
CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations")
cqlsh:newyork> SELECT * FROM characterlocations ; ! time | character | location -‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 5 | president | Air Force 1 10 | president | NYC … …
cassandraTable
1 white house
get[String]get[Int]
1,president,white house
saveToCassandra
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTable
What if we want to filter based on a non-clustering key column?
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTableFilter
What if we want to filter based on a non-clustering key column?
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTable
1 white house
_ (Anonymous Param)
Filter
What if we want to filter based on a non-clustering key column?
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTable
1 white house
get[Int]
1
_ (Anonymous Param)
Filter
What if we want to filter based on a non-clustering key column?
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTable
1 white house
get[Int]
1
>7_ (Anonymous Param)
Filter
What if we want to filter based on a non-clustering key column?
Filter a Table
scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray !res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} )
cassandraTable
1 white house
get[Int]
1
>7_ (Anonymous Param)
Filter
What if we want to filter based on a non-clustering key column?
Backfill a Table with a Different Key!
CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) )
If we actually want to have quick access to timelines we need a
C* table with a different structure.
Backfill a Table with a Different Key!
CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) )
If we actually want to have quick access to timelines we need a
C* table with a different structure.
sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president
Backfill a Table with a Different Key!
CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) )
If we actually want to have quick access to timelines we need a
C* table with a different structure.
sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president C*saveToCassandra
Backfill a Table with a Different Key!
CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) )
If we actually want to have quick access to timelines we need a
C* table with a different structure.
sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president C*saveToCassandra
cqlsh:newyork> select * from timelines; ! character | time | location -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFile
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFileMap
plissken,1,Federal Reservesplit
plissken Federal Reserve1
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFileMap
plissken,1,Federal Reserve
plissken,1,Federal Reserve
splitplissken Federal Reserve1
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFileMap
plissken,1,Federal Reserve
plissken,1,Federal Reserve
splitplissken Federal Reserve1
C*
saveToCassandra
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFileMap
plissken,1,white house
plissken,1,white house
splitplissken white house1
C*
saveToCassandra
cqlsh:newyork> select * from timelines where character = 'plissken'; ! character | time | location -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC
Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines")
I have some data in another source which I could really use in my Cassandra table
textFileMap
plissken,1,white house
plissken,1,white house
splitplissken white house1
C*
saveToCassandra
cqlsh:newyork> select * from timelines where character = 'plissken'; ! character | time | location -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC
Perform a Join with MySQLMaybe a little more than one line …
MySQL Table “quotes” in “escape_from_ny”
import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J added toSpark Shell Classpath val quotes = new JdbcRDD( sc, () => { DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")}, "SELECT * FROM quotes WHERE ? <= ID and ID <= ?”, 0, 100, 5, (r: ResultSet) => { (r.getInt(2),r.getString(3)) } ) !quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
Perform a Join with MySQLMaybe a little more than one line …
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 !quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) !(5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) )
cassandraTable
JdbcRDD
5, ‘Bob Hauk: …'
Needs to be in the form of RDD[K,V]
Perform a Join with MySQLMaybe a little more than one line …
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 !quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) !(5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) )
cassandraTable
JdbcRDD
plissken,5,court
5, ‘Bob Hauk: …'
5,court
Perform a Join with MySQLMaybe a little more than one line …
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 !quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) !(5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) )
cassandraTable
JdbcRDD
plissken,5,court
5, ‘Bob Hauk: …'
5,court 5,(‘Bob Hauk: …’,court)
Perform a Join with MySQLMaybe a little more than one line …
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 !quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) !(5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) )
cassandraTable
JdbcRDD
plissken,5,court
5, ‘Bob Hauk: …'
5,court 5,(‘Bob Hauk: …’,court)
Easy Objects with Case Classes
We have the technology to make this even easier!case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
timelineRow
character,time,location
Easy Objects with Case Classes
We have the technology to make this even easier!case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
Easy Objects with Case Classes
We have the technology to make this even easier!case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
Easy Objects with Case Classes
We have the technology to make this even easier!case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
Easy Objects with Case Classes
We have the technology to make this even easier!case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
character:plissken,time:8,location: Stealth Glider
filter
character == plissken
time == 8
Easy Objects with Case Classes
We have the technology to make this even easier!
The Future
cassandraTable[timelineRow]
timelineRow
character,time,location
character:plissken,time:8,location: Stealth Glider
filter
character == plissken
time == 8
case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider))
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
cassandraTable
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
cassandraTableget[String]
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
white house
cassandraTableget[String]
_.split()
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
white house
white, 1 house, 1
cassandraTableget[String]
_.split()
(_,1)
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTableget[String]
_.split()
(_,1)
_ + _
A Map Reduce for Word Count …
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTableget[String]
_.split()
(_,1)
_ + _
Stand Alone App Examplehttps://github.com/RussellSpitzer/spark-‐cassandra-‐csv
Dodge, Caravan, Red Ford, F150, Black Toyota, Prius, Green
Car, Model, Color
RDD [CassandraRow]
Spark SCC
!!!
Cassandra
FavoriteCars Table
Column Mapping
CSV
Thanks for listening!
Questions?
There is plenty more we can do with Spark but …
Thanks for coming to the meetup!!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Getting started with Cassandra?!
In production?!
Tweet us: @PlanetCassandra!
Thanks for your Time and Come to C* Summit!
Cassandra Summit Link
SEPTEMBER 10 -‐ 11, 2014 | SAN FRANCISCO, CALIF. | THE WESTIN ST. FRANCIS HOTEL
top related