escape from hadoop: spark one liners for c* ops

67
Escape From Hadoop: Spark One Liners for C* Ops Kurt Russell Spitzer DataStax

Upload: russell-spitzer

Post on 26-Jun-2015

1.221 views

Category:

Data & Analytics


2 download

DESCRIPTION

Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!

TRANSCRIPT

Page 1: Escape From Hadoop: Spark One Liners for C* Ops

Escape From Hadoop: Spark One Liners for C* Ops

Kurt Russell Spitzer DataStax

Page 2: Escape From Hadoop: Spark One Liners for C* Ops

Who am I?• Bioinformatics Ph.D from UCSF

• Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!!

• Spends a lot of time spinning up clusters on EC2, GCE, Azure, …http://www.datastax.com/dev/blog/testing-cassandra-1000-nodes-at-a-time

• Developing new ways to make sure that C* Scales

Page 3: Escape From Hadoop: Spark One Liners for C* Ops

Why escape from Hadoop?

HADOOP

Many Moving Pieces

Map Reduce

Lots of Overhead

And there is a way out!

Single Points of Failure

Page 4: Escape From Hadoop: Spark One Liners for C* Ops

Spark Provides a Simple and Efficient framework for Distributed Computations

Node Roles 2In Memory Caching Yes!

Generic DAG Execution Yes!Great Abstraction For Datasets? RDD!

Spark Worker

Spark Worker

Spark Master

Spark Worker

Resilient Distributed Dataset

Spark Executor

Page 5: Escape From Hadoop: Spark One Liners for C* Ops

Spark is Compatible with HDFS, Parquet, CSVs, ….

Page 6: Escape From Hadoop: Spark One Liners for C* Ops

Spark is Compatible with HDFS, Parquet, CSVs, ….

AND

APACHE CASSANDRA

ApacheCassandra

Page 7: Escape From Hadoop: Spark One Liners for C* Ops

Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database

Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down

Page 8: Escape From Hadoop: Spark One Liners for C* Ops

Apache Cassandra Architecture is Very Simple

Replication

Node Roles 1

Replication Tunable

Consistency Tunable

C*

C*C*

C*

Client

Page 9: Escape From Hadoop: Spark One Liners for C* Ops

DataStax OSS Connector Spark to Cassandra

https://github.com/datastax/spark-­‐cassandra-­‐connector

Keyspace Table

Cassandra Spark

RDD[CassandraRow]

RDD[Tuples]

Bundled  and  Supported  with  DSE  4.5!

Page 10: Escape From Hadoop: Spark One Liners for C* Ops

Spark Cassandra Connector uses the DataStax Java Driver to Read from and

Write to C*

Spark C*

Full Token Range

Each Executor Maintains a connection to the C* Cluster

Spark Executor

DataStax Java Driver

Tokens 1-1000

Tokens 1001 -2000

Tokens …

RDD’s read into different splits based on sets of tokens

Page 11: Escape From Hadoop: Spark One Liners for C* Ops

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Page 12: Escape From Hadoop: Spark One Liners for C* Ops

Setting up C* and Spark

DSE > 4.5.0 Just start your nodes with

dse cassandra -k

Apache Cassandra Follow the excellent guide by Al Tobey

http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

Page 13: Escape From Hadoop: Spark One Liners for C* Ops

We need a Distributed System For Analytics and Batch Jobs

But it doesn’t have to be complicated!

Page 14: Escape From Hadoop: Spark One Liners for C* Ops

Even count needs to be distributed

You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or

we could just do one liners on the spark shell.

Ask me to write a Map Reduce for word count, I dare you.

Page 15: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

Page 16: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

scala>  sc.cassandraTable(“newyork","presidentlocations")      

cassandraTable

Page 17: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting a Table and Counting

CREATE  KEYSPACE  newyork  WITH  replication  =  {'class':  'SimpleStrategy',  'replication_factor':  1  };  use  newyork;  CREATE  TABLE  presidentlocations  (  time  int,  location  text  ,  PRIMARY  KEY  time  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  1  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  2  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  3  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  4  ,  'White  House'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  5  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  6  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  7  ,  'Air  Force  1'  );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  8  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  9  ,  'NYC'    );  INSERT  INTO  presidentlocations  (time,  location  )  VALUES  (  10  ,  'NYC'    );

scala>  sc.cassandraTable(“newyork","presidentlocations")     .count  res3:  Long  =  10

cassandraTable

count10

Page 18: Escape From Hadoop: Spark One Liners for C* Ops

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations")  

cassandraTable

Page 19: Escape From Hadoop: Spark One Liners for C* Ops

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

Page 20: Escape From Hadoop: Spark One Liners for C* Ops

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

cassandraTable

scala>  sc.cassandraTable(“newyork","presidentlocations")

Page 21: Escape From Hadoop: Spark One Liners for C* Ops

Basics: take() and toArrayscala>  sc.cassandraTable("newyork","presidentlocations").take(1)  !res2:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(CassandraRow{time:  9,  location:  NYC})

cassandraTable

take(1)

9 NYC

Array of CassandraRows

cassandraTable

toArray9 NYC

Array of CassandraRows

scala>  sc.cassandraTable(“newyork","presidentlocations").toArray  !res3:  Array[com.datastax.spark.connector.CassandraRow]  =  Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  3,  location:  White  House},       …,   CassandraRow{time:  6,  location:  Air  Force  1})

9 NYC9 NYC9 NYC9 NYC

Page 22: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Page 23: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

take(1)

9 NYC

Array of CassandraRows

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Page 24: Escape From Hadoop: Spark One Liners for C* Ops

Basics: Getting Row Values out of a CassandraRow

scala>  sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")  !res5:  Int  =  9

cassandraTable

take(1)

9 NYC

Array of CassandraRows

9get[Int]

get[Int]get[String] … get[Any]

Got Null ? get[Option[Int]]

http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

Page 25: Escape From Hadoop: Spark One Liners for C* Ops

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

Page 26: Escape From Hadoop: Spark One Liners for C* Ops

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

Page 27: Escape From Hadoop: Spark One Liners for C* Ops

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

Page 28: Escape From Hadoop: Spark One Liners for C* Ops

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

Page 29: Escape From Hadoop: Spark One Liners for C* Ops

C*

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

saveToCassandra

Page 30: Escape From Hadoop: Spark One Liners for C* Ops

C*

Copy A TableSay we want to restructure our table or add a new column?

CREATE  TABLE  characterlocations  (     time  int,       character  text,       location  text,       PRIMARY  KEY  (time,character)  );

sc.cassandraTable(“newyork","presidentlocations")     .map(  row  =>  (         row.get[Int](“time"),         "president",           row.get[String](“location")     )).saveToCassandra("newyork","characterlocations")

cqlsh:newyork>  SELECT  *  FROM  characterlocations  ;  !  time  |  character  |  location  -­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐          5  |  president  |  Air  Force  1        10  |  president  |                  NYC  …  …  

cassandraTable

1 white house

get[String]get[Int]

1,president,white house

saveToCassandra

Page 31: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

What if we want to filter based on a non-clustering key column?

Page 32: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTableFilter

What if we want to filter based on a non-clustering key column?

Page 33: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Page 34: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Page 35: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

>7_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Page 36: Escape From Hadoop: Spark One Liners for C* Ops

Filter a Table

scala>  sc.cassandraTable(“newyork","presidentlocations")     .filter(  _.get[Int]("time")  >  7  )     .toArray  !res9:  Array[com.datastax.spark.connector.CassandraRow]  =    Array(     CassandraRow{time:  9,  location:  NYC},       CassandraRow{time:  10,  location:  NYC},       CassandraRow{time:  8,  location:  NYC}  )

cassandraTable

1 white house

get[Int]

1

>7_ (Anonymous Param)

Filter

What if we want to filter based on a non-clustering key column?

Page 37: Escape From Hadoop: Spark One Liners for C* Ops

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

Page 38: Escape From Hadoop: Spark One Liners for C* Ops

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president

Page 39: Escape From Hadoop: Spark One Liners for C* Ops

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president C*saveToCassandra

Page 40: Escape From Hadoop: Spark One Liners for C* Ops

Backfill a Table with a Different Key!

CREATE  TABLE  timelines  (      time  int,      character  text,      location  text,      PRIMARY  KEY  ((character),  time)  )

If we actually want to have quick access to timelines we need a

C* table with a different structure.

sc.cassandraTable(“newyork","characterlocations")     .saveToCassandra("newyork","timelines")

1 white house

cassandraTable

president C*saveToCassandra

cqlsh:newyork>  select  *  from  timelines;  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    president  |        1  |  White  House    president  |        2  |  White  House    president  |        3  |  White  House    president  |        4  |  White  House    president  |        5  |  Air  Force  1    president  |        6  |  Air  Force  1    president  |        7  |  Air  Force  1    president  |        8  |                  NYC    president  |        9  |                  NYC    president  |      10  |                  NYC

Page 41: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFile

Page 42: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reservesplit

plissken Federal Reserve1

Page 43: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reserve

plissken,1,Federal Reserve

splitplissken Federal Reserve1

Page 44: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,Federal Reserve

plissken,1,Federal Reserve

splitplissken Federal Reserve1

C*

saveToCassandra

Page 45: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,white house

plissken,1,white house

splitplissken white house1

C*

saveToCassandra

cqlsh:newyork>  select  *  from  timelines  where  character  =  'plissken';  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      plissken  |        1  |  Federal  Reserve      plissken  |        2  |  Federal  Reserve      plissken  |        3  |  Federal  Reserve      plissken  |        4  |                      Court      plissken  |        5  |                      Court      plissken  |        6  |                      Court      plissken  |        7  |                      Court      plissken  |        8  |    Stealth  Glider      plissken  |        9  |                          NYC      plissken  |      10  |                          NYC

Page 46: Escape From Hadoop: Spark One Liners for C* Ops

Import a CSVsc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)     .map(_.split(“,"))     .map(  line  =>         (line(0),line(1),line(2)))     .saveToCassandra("newyork","timelines")

I have some data in another source which I could really use in my Cassandra table

textFileMap

plissken,1,white house

plissken,1,white house

splitplissken white house1

C*

saveToCassandra

cqlsh:newyork>  select  *  from  timelines  where  character  =  'plissken';  !  character  |  time  |  location  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      plissken  |        1  |  Federal  Reserve      plissken  |        2  |  Federal  Reserve      plissken  |        3  |  Federal  Reserve      plissken  |        4  |                      Court      plissken  |        5  |                      Court      plissken  |        6  |                      Court      plissken  |        7  |                      Court      plissken  |        8  |    Stealth  Glider      plissken  |        9  |                          NYC      plissken  |      10  |                          NYC

Page 47: Escape From Hadoop: Spark One Liners for C* Ops

Perform a Join with MySQLMaybe a little more than one line …

MySQL Table “quotes” in “escape_from_ny”

import  java.sql._  import  org.apache.spark.rdd.JdbcRDD  Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J  added  toSpark  Shell  Classpath  val  quotes  =  new  JdbcRDD(     sc,       ()  =>  {       DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},       "SELECT  *  FROM  quotes  WHERE  ?  <=  ID  and  ID  <=  ?”,     0,     100,     5,       (r:  ResultSet)  =>  {       (r.getInt(2),r.getString(3))     }  )  !quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  

Page 48: Escape From Hadoop: Spark One Liners for C* Ops

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

5, ‘Bob Hauk: …'

Needs to be in the form of RDD[K,V]

Page 49: Escape From Hadoop: Spark One Liners for C* Ops

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court

Page 50: Escape From Hadoop: Spark One Liners for C* Ops

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court 5,(‘Bob Hauk: …’,court)

Page 51: Escape From Hadoop: Spark One Liners for C* Ops

Perform a Join with MySQLMaybe a little more than one line …

quotes:  org.apache.spark.rdd.JdbcRDD[(Int,  String)]  =  JdbcRDD[9]  at  JdbcRDD  at  <console>:23  !quotes.join(     sc.cassandraTable(“newyork","timelines")     .filter(  _.get[String]("character")  ==  “plissken")     .map(  row  =>  (row.get[Int](“time"),row.get[String]("location"))))     .take(1)     .foreach(println)  !(5,     (Bob  Hauk:     There  was  an  accident.           About  an  hour  ago,  a  small  jet  went  down  inside  New  York  City.           The  President  was  on  board.      Snake  Plissken:  The  president  of  what?,     Court)  )

cassandraTable

JdbcRDD

plissken,5,court

5, ‘Bob Hauk: …'

5,court 5,(‘Bob Hauk: …’,court)

Page 52: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

timelineRow

character,time,location

Page 53: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

Page 54: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

filter

character == plissken

Page 55: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

filter

character == plissken

time == 8

Page 56: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

cassandraTable[timelineRow]

timelineRow

character,time,location

character:plissken,time:8,location: Stealth Glider

filter

character == plissken

time == 8

Page 57: Escape From Hadoop: Spark One Liners for C* Ops

Easy Objects with Case Classes

We have the technology to make this even easier!

The Future

cassandraTable[timelineRow]

timelineRow

character,time,location

character:plissken,time:8,location: Stealth Glider

filter

character == plissken

time == 8

case  class  timelineRow    (character:String,  time:Int,  location:String)  sc.cassandraTable[timelineRow](“newyork","timelines")     .filter(  _.character  ==  “plissken")     .filter(  _.time  ==  8)     .toArray  res13:  Array[timelineRow]  =  Array(timelineRow(plissken,8,Stealth  Glider))

Page 58: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

cassandraTable

Page 59: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

cassandraTableget[String]

Page 60: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

cassandraTableget[String]

_.split()

Page 61: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

cassandraTableget[String]

_.split()

(_,1)

Page 62: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

(_,1)

_ + _

Page 63: Escape From Hadoop: Spark One Liners for C* Ops

A Map Reduce for Word Count …

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

(_,1)

_ + _

Page 64: Escape From Hadoop: Spark One Liners for C* Ops

Stand Alone App Examplehttps://github.com/RussellSpitzer/spark-­‐cassandra-­‐csv

Dodge,  Caravan,  Red  Ford,  F150,  Black  Toyota,  Prius,  Green

Car,  Model,  Color

RDD  [CassandraRow]

Spark SCC

!!!

Cassandra

FavoriteCars Table

Column  Mapping

CSV

Page 65: Escape From Hadoop: Spark One Liners for C* Ops

Thanks for listening!

Questions?

There is plenty more we can do with Spark but …

Page 66: Escape From Hadoop: Spark One Liners for C* Ops

Thanks for coming to the meetup!!

DataStax Academy offers free online Cassandra training!

Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages!

Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org!

Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!

Email us: [email protected]!

Getting started with Cassandra?!

In production?!

Tweet us: @PlanetCassandra!

Page 67: Escape From Hadoop: Spark One Liners for C* Ops

Thanks  for  your  Time  and  Come  to  C*  Summit!  

Cassandra  Summit  Link

SEPTEMBER  10  -­‐  11,  2014    |    SAN  FRANCISCO,  CALIF.    |    THE  WESTIN  ST.  FRANCIS  HOTEL