escape from hadoop: spark one liners for c* ops

Post on 26-Jun-2015

1.221 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!

TRANSCRIPT

Escape From Hadoop: Spark One Liners for C* Ops

Kurt Russell Spitzer DataStax

Who am I?• Bioinformatics Ph.D from UCSF

• Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!!

• Spends a lot of time spinning up clusters on EC2, GCE, Azure, …http://www.datastax.com/dev/blog/testing-cassandra-1000-nodes-at-a-time

• Developing new ways to make sure that C* Scales

Why escape from Hadoop?

HADOOP

Many Moving Pieces

Map Reduce

Lots of Overhead

And there is a way out!

Single Points of Failure

Spark Provides a Simple and Efficient framework for Distributed Computations

Node Roles 2In Memory Caching Yes!

Generic DAG Execution Yes!Great Abstraction For Datasets? RDD!

Spark Worker

Spark Worker

Spark Master

Spark Worker

Resilient Distributed Dataset

Spark Executor

Spark is Compatible with HDFS, Parquet, CSVs, ….

Spark is Compatible with HDFS, Parquet, CSVs, ….

AND

APACHE CASSANDRA

ApacheCassandra

Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database

Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down

Apache Cassandra Architecture is Very Simple

Replication

Node Roles 1

Replication Tunable

Consistency Tunable

C*

C*C*

C*

Client

DataStax OSS Connector Spark to Cassandra

https://github.com/datastax/spark-­‐cassandra-­‐connector

Keyspace Table

Cassandra Spark

RDD[CassandraRow]

RDD[Tuples]

Bundled  and  Supported  with  DSE  4.5!

Spark Cassandra Connector uses the DataStax Java Driver to Read from and

Write to C*

Spark C*

Full Token Range

Each Executor Maintains a connection to the C* Cluster

Spark Executor

DataStax Java Driver

Tokens 1-1000

Tokens 1001 -2000

Tokens …

RDD’s read into different splits based on sets of tokens

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Setting up C* and Spark

DSE > 4.5.0 Just start your nodes with

dse cassandra -k

Apache Cassandra Follow the excellent guide by Al Tobey

http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

We need a Distributed System For Analytics and Batch Jobs

But it doesn’t have to be complicated!

Even count needs to be distributed

You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or

we could just do one liners on the spark shell.

Ask me to write a Map Reduce for word count, I dare you.

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

scala>  sc.cassandraTable(“newyork","presidentlocations")      

cassandraTable

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

scala>  sc.cassandraTable(“newyork","presidentlocations")     .count  res3:  Long  =  10

cassandraTable

count10

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations")  

cassandraTable

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

cassandraTable

scala>  sc.cassandraTable(“newyork","presidentlocations")

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

cassandraTable

toArray9 NYC

Array of CassandraRows

scala>  sc.cassandraTable(“newyork","presidentlocations").toArray  !res3:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  3,  location:  White  House},       …,   CassandraRow{time:  6,  location:  Air  Force  1})

9 NYC9 NYC9 NYC9 NYC

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

take(1)

9 NYC

Array of CassandraRows

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

take(1)

9 NYC

Array of CassandraRows

9get[Int]

get[Int]get[String] … get[Any]

Got Null ? get[Option[Int]]

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

C*

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

saveToCassandra

C*

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cqlsh:newyork>  SELECT  *  FROM  characterlocations  ;  !  time  |  character  |  location  -­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐          5  |  president  |  Air  Force  1        10  |  president  |                  NYC  …  …  

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

saveToCassandra

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

What if we want to filter based on a non-clustering key column?

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTableFilter

What if we want to filter based on a non-clustering key column?

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

>7_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

>7_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president C*saveToCassandra

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president C*saveToCassandra

cqlsh:newyork>  select  *  from  timelines;  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    president  |        1  |  White  House    president  |        2  |  White  House    president  |        3  |  White  House    president  |        4  |  White  House    president  |        5  |  Air  Force  1    president  |        6  |  Air  Force  1    president  |        7  |  Air  Force  1    president  |        8  |                  NYC    president  |        9  |                  NYC    president  |      10  |                  NYC

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFile

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reservesplit

plissken Federal Reserve1

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reserve

plissken,1,Federal Reserve

splitplissken Federal Reserve1

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reserve

plissken,1,Federal Reserve

splitplissken Federal Reserve1

C*

saveToCassandra

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,white house

plissken,1,white house

splitplissken white house1

C*

saveToCassandra

cqlsh:newyork>  select  *  from  timelines  where  character  =  'plissken';  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      plissken  |        1  |  Federal  Reserve      plissken  |        2  |  Federal  Reserve      plissken  |        3  |  Federal  Reserve      plissken  |        4  |                      Court      plissken  |        5  |                      Court      plissken  |        6  |                      Court      plissken  |        7  |                      Court      plissken  |        8  |    Stealth  Glider      plissken  |        9  |                          NYC      plissken  |      10  |                          NYC

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,white house

plissken,1,white house

splitplissken white house1

C*

saveToCassandra

cqlsh:newyork>  select  *  from  timelines  where  character  =  'plissken';  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      plissken  |        1  |  Federal  Reserve      plissken  |        2  |  Federal  Reserve      plissken  |        3  |  Federal  Reserve      plissken  |        4  |                      Court      plissken  |        5  |                      Court      plissken  |        6  |                      Court      plissken  |        7  |                      Court      plissken  |        8  |    Stealth  Glider      plissken  |        9  |                          NYC      plissken  |      10  |                          NYC

Perform a Join with MySQLMaybe a little more than one line …

MySQL Table “quotes” in “escape_from_ny”

import  java.sql._  import  org.apache.spark.rdd.JdbcRDD  Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J  added  toSpark  Shell  Classpath  val  quotes  =  new  JdbcRDD(     sc,       ()  =>  {       DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},       "SELECT  *  FROM  quotes  WHERE  ?  <=  ID  and  ID  <=  ?”,     0,     100,     5,       (r:  ResultSet)  =>  {       (r.getInt(2),r.getString(3))     }  )  !quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

5, ‘Bob Hauk: …'

Needs to be in the form of RDD[K,V]

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court 5,(‘Bob Hauk: …’,court)

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court 5,(‘Bob Hauk: …’,court)

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

timelineRow

character,time,location

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

filter

character == plissken

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

filter

character == plissken

time == 8

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

character:plissken,time:8,location: Stealth Glider

filter

character == plissken

time == 8

Easy Objects with Case Classes

We have the technology to make this even easier!

The Future

cassandraTable[timelineRow]

timelineRow

character,time,location

character:plissken,time:8,location: Stealth Glider

filter

character == plissken

time == 8

case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

cassandraTable

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

cassandraTableget[String]

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

cassandraTableget[String]

_.split()

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

cassandraTableget[String]

_.split()

(_,1)

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

(_,1)

_ + _

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

(_,1)

_ + _

Stand Alone App Examplehttps://github.com/RussellSpitzer/spark-­‐cassandra-­‐csv

Dodge,  Caravan,  Red  Ford,  F150,  Black  Toyota,  Prius,  Green

Car,  Model,  Color

RDD  [CassandraRow]

Spark SCC

!!!

Cassandra

FavoriteCars Table

Column  Mapping

CSV

Thanks for listening!

Questions?

There is plenty more we can do with Spark but …

Thanks for coming to the meetup!!

DataStax Academy offers free online Cassandra training!

Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages!

Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org!

Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!

Email us: Community@DataStax.com!

Getting started with Cassandra?!

In production?!

Tweet us: @PlanetCassandra!

Thanks  for  your  Time  and  Come  to  C*  Summit!  

Cassandra  Summit  Link

SEPTEMBER  10  -­‐  11,  2014    |    SAN  FRANCISCO,  CALIF.    |    THE  WESTIN  ST.  FRANCIS  HOTEL

top related