spark streaming and mllib - hyderabad spark group
TRANSCRIPT
Spark Streaming and MLlibThe stack for distributed,
massively scalable, (near) real-timedata processing and machine learning
present
Phaneendra Chiruvella
http://twitter.com/pcx66
Hyderabad Spark Group & Zemoso Technologies
Agenda● Brief intro to Spark Core
● Introduction to Spark Streaming
● What is the world talking about?: A demo of Spark
Streaming with Twitter
● Introduction to Spark MLlib
● Let’s see what movies you might like: A demo of Spark
MLlib by building a Movie Recommendation Engine
Spark: Lightning-fast cluster computing ● Data processing engine
● Distributed
● Massively scalable: Known largest cluster size is 8,000
machines with PBs of data processed
● Programmable in Scala, Java, Python and R
● Interactive shell
● Both Batch & Stream processing
● Stable and robust: being used in production at many
companies
● Known to work well with other “Big data” tools like Kafka,
Cassandra, HDFS, HBase, etc.
Image source: http://spark.apache.org/docs/latest/cluster-overview.html
Spark: How it works?
● Every application has it’s own
SparkContext
● Cluster Managers available are:
Spark Standalone, YARN, Mesos
Image source: http://spark.apache.org/docs/latest/cluster-overview.html
Spark: Resilient Distributed DatasetsRDD is the fundamental abstraction of Spark, providing a rich, fault-tolerant layer over a cluster of machines
Executors
SparkContext
RDD
Spark Core: Demo● Creating RDDs
● Transformations
● Actions
● Cache
Spark Streaming:batch processing not enuf!
● Extension to Core API● Micro-batches processed in
realtime● Minimize latency to seconds
Spark Streaming: How it works?● DStreams - Just a chain of RDDs
● Batch Interval, Input DStreams and Receivers
● Some Input Sources: Sockets, File systems, Kafka, Twitter
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming: How it works?● Windowed operations
● DStream Transformations are translated to RDD Transformations
● Direct access to RDDs underneath
Image source: spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming: Demo● What is the world talking about?: A Twitter stream analysis
Spark MLlib: Just analytics not enuf!● Practical, scalable ML library with
implementations of several common
algorithms and more being added
● Alternative spark.ml high-level API based
on spark.sql.DataFrame. Out of scope
for our current talk.
Spark MLlib: Demo● Let’s see what movies you might like: A demo of Spark MLlib by building a
Movie Recommendation Engine
Spark: Streaming and MLlib, match made in heaven!
● MLlib provides algorithms that can learn on streaming data and simultaneously apply on the streaming data!
● Also, a large set of algorithms that
can learn offline and be applied on
the streaming data
Spark: What next?● Spark SQL - A SQL-like layer over RDDs● spark.ml● Spark GraphX - A graph-processing abstraction over RDDs● Apache Storm and Apache Flink - Modern streaming-first systems
Q&A
Thank you!Slide deck will be made available at:http://blog.zemosolabs.com/
Spark Docs are a great place to get startedhttp://spark.apache.org/docs/latest/programming-guide.html
Acknowledgements:Code demos are from Databricks TrainingMemes generated from ImgFlip.com