feeding cassandra with spark-streaming and kafka

Feeding Cassandra with Spark Streaming & KafkaCary Bourgeois Solutions Engineer DataStax, Central Region

Who Am I

• Datastax < 2 Years • Not a “developer” • Legacy BI/Database

• Business Objects• SAP

• Demo Development • R • Java (If I have to) • Scala (Someday)

2

3

Cassandra Summit 2015 September 22-24, Santa Clara Convention Center

7,000 Attendees

Last Week - Mission Impossible?A Stretch but possible.

4

Sunday Afternoon - I’m getting my A$$ kicked

5

Monday Afternoon - Arghhhhh!

6

Monday Night - I got this!

7

8

Capture Raw Data

Analyze & ∑ummarize

Why Mess with Success?

• Spark 1.3+ • New/Improved Kafka

Support • Dataframes

• Datastax Enterprise 4.8 • Spark 1.4 support

9

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Why Mess with Success?

• Spark 1.3+ • New/Improved Kafka

Support • Dataframes

• Datastax Enterprise 4.8 • Spark 1.4 support

10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.FastA single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.ScalableKafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersDurableMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Distributed by DesignKafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11

• Producers • Consumers • Persistence • Topics • Partitions • Replication

12

http://kafka.apache.org/documentation.html

http://kafka.apache.org/documentation.html

• Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts

• List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list

• Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning

13

Confidential

Kafka and the Producer

14

The Producer App

• Lots of Options • I chose

• Scala • Not steep enough

• Akka

• Producing this message

15

Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152

Destination - Cassandra Tables

16

CREATE TABLE demo.data (edge_id text,sensor text,epoch_hr text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)

)

CREATE TABLE demo.last (edge_id text,sensor text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor ))

)

CREATE TABLE demo.count (pk int,ts timestamp,count bigint,count_ma double,PRIMARY KEY (pk, ts)

)

DSE Analytics => Spark

• No ETL • Spark 1.4.1 certification • Simplified map and reduce • Very developer Friendly

• SparkSQL • Spark Streaming • Machine Learning

• DSE Analytics and Search Integration • Cassandra benefits (scaling, availability)

17

“I want to do processing on data before it hits Cassandra.” “I need my sums, avgs, group by’s ETC.” “I want to run real-time analytics on my Cassandra data.”

Processing the Stream

• Simple Scala Job • Deal with the raw flow

• Capture the raw data • Capture the latest sensor

reading • Summarize and Analyze

• Windowing the Stream • Count Records every x

seconds • Calculate a moving average

of every x seconds over a number of periods. 18

Confidential

Full Demo

19

Next Steps

• SparkR • MLLib workflows • Notebooks

• Spark • Jupyter

20

If you would like the code:

21

https://github.com/CaryBourgeois/KafkaSparkCassandraDemo

feeding cassandra with spark-streaming and kafka

Technology