feeding cassandra with spark-streaming and kafka
TRANSCRIPT
Feeding Cassandra with Spark Streaming & KafkaCary Bourgeois Solutions Engineer DataStax, Central Region
Who Am I
• Datastax < 2 Years • Not a “developer” • Legacy BI/Database
• Business Objects• SAP
• Demo Development • R • Java (If I have to) • Scala (Someday)
2
3
Cassandra Summit 2015 September 22-24, Santa Clara Convention Center
7,000 Attendees
Last Week - Mission Impossible?A Stretch but possible.
4
Sunday Afternoon - I’m getting my A$$ kicked
5
Monday Afternoon - Arghhhhh!
6
Monday Night - I got this!
7
8
Capture Raw Data
Analyze & ∑ummarize
Why Mess with Success?
• Spark 1.3+ • New/Improved Kafka
Support • Dataframes
• Datastax Enterprise 4.8 • Spark 1.4 support
9
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Why Mess with Success?
• Spark 1.3+ • New/Improved Kafka
Support • Dataframes
• Datastax Enterprise 4.8 • Spark 1.4 support
10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.FastA single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.ScalableKafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersDurableMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Distributed by DesignKafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11
• Producers • Consumers • Persistence • Topics • Partitions • Replication
12
http://kafka.apache.org/documentation.html
• Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts
• List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list
• Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning
13
Confidential
Kafka and the Producer
14
The Producer App
• Lots of Options • I chose
• Scala • Not steep enough
• Akka
• Producing this message
15
Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152
Destination - Cassandra Tables
16
CREATE TABLE demo.data (edge_id text,sensor text,epoch_hr text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)
)
CREATE TABLE demo.last (edge_id text,sensor text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor ))
)
CREATE TABLE demo.count (pk int,ts timestamp,count bigint,count_ma double,PRIMARY KEY (pk, ts)
)
DSE Analytics => Spark
• No ETL • Spark 1.4.1 certification • Simplified map and reduce • Very developer Friendly
• SparkSQL • Spark Streaming • Machine Learning
• DSE Analytics and Search Integration • Cassandra benefits (scaling, availability)
17
“I want to do processing on data before it hits Cassandra.” “I need my sums, avgs, group by’s ETC.” “I want to run real-time analytics on my Cassandra data.”
Processing the Stream
• Simple Scala Job • Deal with the raw flow
• Capture the raw data • Capture the latest sensor
reading • Summarize and Analyze
• Windowing the Stream • Count Records every x
seconds • Calculate a moving average
of every x seconds over a number of periods. 18
Confidential
Full Demo
19
Next Steps
• SparkR • MLLib workflows • Notebooks
• Spark • Jupyter
20
If you would like the code:
21
https://github.com/CaryBourgeois/KafkaSparkCassandraDemo