reading cassandra meetup feb 2015: apache spark

45
@chbatey Christopher Batey Technical Evangelist for Apache Cassandra Cassandra Spark Integration

Upload: christopher-batey

Post on 27-Jul-2015

375 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Christopher BateyTechnical Evangelist for Apache Cassandra

Cassandra Spark Integration

Page 2: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Agenda• Spark intro• Spark Cassandra connector• Examples:- Migrating from MySQL to Cassandra- Cassandra schema migrations- Import data from flat file into Cassandra- Spark SQL on Cassandra- Spark Streaming and Cassandra

Page 3: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Scalability & Performance• Scalability- No single point of failure- No special nodes that become the bottle neck- Work/data can be re-distributed• Operational Performance i.e single digit ms- Single node for query- Single disk seek per query

Page 4: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Cassandra can not join or aggregate

Client

Where do I go for the max?

Page 5: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Denormalisation

Page 6: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

But but…• Sometimes you don’t need a answers in milliseconds• Data models done wrong - how do I fix it?• New requirements for old data?• Ad-hoc operational queries• Managers always want counts / maxs

Page 7: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Apache Spark• 10x faster on disk,100x faster in memory than Hadoop

MR• Works out of the box on EMR• Fault Tolerant Distributed Datasets• Batch, iterative and streaming analysis• In Memory Storage and Disk • Integrates with Most File and Storage Options

Page 8: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Components

Sharkor

Spark SQLStreaming ML

Spark (General execution engine)

Graph

Cassandra

Compatible

Page 9: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark architecture

Page 10: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

org.apache.spark.rdd.RDD• Resilient Distributed Dataset (RDD)• Created through transformations on data (map,filter..) or other RDDs • Immutable• Partitioned• Reusable

Page 11: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

RDD Operations• Transformations - Similar to Scala collections API• Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

• Actions• Require materialization of the records to generate a value• collect: Array[T], count, fold, reduce..

Page 12: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Word count

val file: RDD[String] = sc.textFile("hdfs://...")

val counts: RDD[(String, Int)] = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 13: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark shell

Page 14: Reading Cassandra Meetup Feb 2015: Apache Spark

Operator Graph: Optimisation and Fault Tolerance

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

= Cached partition= RDD

Page 15: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Partitioning• Large data sets from S3, HDFS, Cassandra etc• Split into small chunks called partitions• Each operation is done locally on a partition before

combining other partitions• So partitioning is important for data locality

Page 16: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark Streaming

Page 17: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Cassandra

Page 18: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark on Cassandra• Server-Side filters (where clauses)• Cross-table operations (JOIN, UNION, etc.)• Data locality-aware (speed)• Data transformation, aggregation, etc.

Page 19: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark Cassandra Connector• Loads data from Cassandra to Spark• Writes data from Spark to Cassandra• Implicit Type Conversions and Object Mapping• Implemented in Scala (offers a Java API)• Open Source • Exposes Cassandra Tables as Spark RDDs + Spark

DStreams

Page 20: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Analytics Workload Isolation

Page 21: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Deployment• Spark worker in each of the

Cassandra nodes• Partitions made up of LOCAL

cassandra data

S C

S C

S C

S C

Page 22: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Example Time

Page 23: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

It is on Github

"org.apache.spark" %% "spark-core" % sparkVersion"org.apache.spark" %% "spark-streaming" % sparkVersion"org.apache.spark" %% "spark-sql" % sparkVersion"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion"com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion

Page 24: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Page 25: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Boiler plateimport com.datastax.spark.connector.rdd._import org.apache.spark._import com.datastax.spark.connector._import com.datastax.spark.connector.cql._object BasicCassandraInteraction extends App { val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext("local[4]", "AppName", conf)

// cool stuff}

Cassandra Host

Spark master e.g spark://host:port

Page 26: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Executing code against the driver

CassandraConnector(conf).withSessionDo { session => session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)") session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)") session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)") session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)") }

Page 27: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Reading data from CassandraCassandraConnector(conf).withSessionDo { session => session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)") session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)") session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)") session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)") } val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv") println(rdd.count())println(rdd.first())println(rdd.max()(new Ordering[CassandraRow] { override def compare(x: CassandraRow, y: CassandraRow): Int = x.getInt("value").compare(y.getInt("value"))}))

Page 28: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Word Count + Save to Cassandra

val textFile: RDD[String] = sc.textFile("Spark-Readme.md") val words: RDD[String] = textFile.flatMap(line => line.split("\\s+")) val wordAndCount: RDD[(String, Int)] = words.map((_, 1)) val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)println(wordCounts.first())wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))

Page 29: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Migrating from an RDMScreate table store( store_name varchar(32) primary key, location varchar(32), store_type varchar(10)); create table staff( name varchar(32) primary key, favourite_colour varchar(32), job_title varchar(32)); create table customer_events( id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY, customer varchar(12), time timestamp, event_type varchar(16), store varchar(32), staff varchar(32), foreign key fk_store(store) references store(store_name), foreign key fk_staff(staff) references staff(name))

Page 30: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Denormalised tableCREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid,

event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id))

Page 31: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Migration time

val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)}, "select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " + "and ce.id >= ? and ce.id <= ?", 0, 1000, 6, (r: ResultSet) => { (r.getString("customer"), r.getTimestamp("time"), UUID.randomUUID(), r.getString("event_type"), r.getString("store_name"), r.getString("location"), r.getString("store_type"), r.getString("staff"), r.getString("job_title") ) })customerEvents.saveToCassandra("test", "customer_events", SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name", "staff_title"))

Page 32: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Issues with denormalisation• What happens when I need to query the denormalised

data a different way?

Page 33: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Store it twiceCREATE TABLE IF NOT EXISTS customer_events(customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id))

CREATE TABLE IF NOT EXISTS customer_events_by_staff( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((staff_name), time, id))

Page 34: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

My reaction a year ago

Page 35: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Too simple

val events_by_customer = sc.cassandraTable("test", “customer_events") events_by_customer.saveToCassandra("test", "customer_events_by_staff", SomeColumns("customer_id", "time", "id", "event_type", "staff_name", "staff_title", "store_location", "store_name", "store_type"))

Page 36: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Aggregations with Spark SQLPartition Key Clustering Columns

Page 37: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Now now…val cc = new CassandraSQLContext(sc) cc.setKeyspace("test")

val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events GROUP BY store_name, event_type")

rdd.collect().foreach(println)

[SportsApp,WATCH_STREAM,1][SportsApp,LOGOUT,1][SportsApp,LOGIN,1][ChrisBatey.com,WATCH_MOVIE,1][ChrisBatey.com,LOGOUT,1][ChrisBatey.com,BUY_MOVIE,1][SportsApp,WATCH_MOVIE,2]

Page 38: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Lamda architecture

http://lambda-architecture.net/

Page 39: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Spark Streaming

Page 40: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Network word countCassandraConnector(conf).withSessionDo { session => session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)") session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)") } val ssc = new StreamingContext(conf, Seconds(5))val lines = ssc.socketTextStream("localhost", 9999) lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw") val words = lines.flatMap(_.split("\\s+")) val countOfOne = words.map((_, 1)) val reduced = countOfOne.reduceByKey(_ + _)reduced.saveToCassandra("test", "network_word_count")

Page 41: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Kafka• Partitioned pub sub system• Very high throughput• Very scalable

Page 42: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Stream processing customer eventsval joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY")) val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))

CassandraConnector(conf).withSessionDo { session => session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)") session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " + "customer_id text, " + "staff_id text, " + "store_type text, " + "group text static, " + "content text, " + "time timeuuid, " + "event_type text, " + "PRIMARY KEY ((customer_id), time) )") }

Page 43: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Save + Processval rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) => parse(v).extract[CustomerEvent]}) events.saveToCassandra("streaming", "customer_events")

val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")

Page 44: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Summary• Cassandra is an operational database• Spark gives us the flexibility to do slower things- Schema migrations- Ad-hoc queries- Report generation• Spark streaming + Cassandra allow us to build online

analytical platforms

Page 45: Reading Cassandra Meetup Feb 2015: Apache Spark

@chbatey

Thanks for listening• Follow me on twitter @chbatey• Cassandra + Fault tolerance posts a plenty: • http://christopher-batey.blogspot.co.uk/• Github for all examples: • https://github.com/chbatey/spark-sandbox• Cassandra resources: http://planetcassandra.org/