spark streaming and mllib - hyderabad spark group

15
Spark Streaming and MLlib The stack for distributed, massively scalable, (near) real-time data processing and machine learning present Phaneendra Chiruvella http://twitter.com/pcx66 Hyderabad Spark Group & Zemoso Technologies

Upload: phaneendra-chiruvella

Post on 14-Apr-2017

377 views

Category:

Technology


13 download

TRANSCRIPT

Page 1: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming and MLlibThe stack for distributed,

massively scalable, (near) real-timedata processing and machine learning

present

Phaneendra Chiruvella

http://twitter.com/pcx66

Hyderabad Spark Group & Zemoso Technologies

Page 2: Spark Streaming and MLlib - Hyderabad Spark Group

Agenda● Brief intro to Spark Core

● Introduction to Spark Streaming

● What is the world talking about?: A demo of Spark

Streaming with Twitter

● Introduction to Spark MLlib

● Let’s see what movies you might like: A demo of Spark

MLlib by building a Movie Recommendation Engine

Page 3: Spark Streaming and MLlib - Hyderabad Spark Group

Spark: Lightning-fast cluster computing ● Data processing engine

● Distributed

● Massively scalable: Known largest cluster size is 8,000

machines with PBs of data processed

● Programmable in Scala, Java, Python and R

● Interactive shell

● Both Batch & Stream processing

● Stable and robust: being used in production at many

companies

● Known to work well with other “Big data” tools like Kafka,

Cassandra, HDFS, HBase, etc.

Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Page 4: Spark Streaming and MLlib - Hyderabad Spark Group

Spark: How it works?

● Every application has it’s own

SparkContext

● Cluster Managers available are:

Spark Standalone, YARN, Mesos

Image source: http://spark.apache.org/docs/latest/cluster-overview.html

Page 5: Spark Streaming and MLlib - Hyderabad Spark Group

Spark: Resilient Distributed DatasetsRDD is the fundamental abstraction of Spark, providing a rich, fault-tolerant layer over a cluster of machines

Executors

SparkContext

RDD

Page 6: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Core: Demo● Creating RDDs

● Transformations

● Actions

● Cache

Page 7: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming:batch processing not enuf!

● Extension to Core API● Micro-batches processed in

realtime● Minimize latency to seconds

Page 8: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming: How it works?● DStreams - Just a chain of RDDs

● Batch Interval, Input DStreams and Receivers

● Some Input Sources: Sockets, File systems, Kafka, Twitter

Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Page 9: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming: How it works?● Windowed operations

● DStream Transformations are translated to RDD Transformations

● Direct access to RDDs underneath

Image source: spark.apache.org/docs/latest/streaming-programming-guide.html

Page 10: Spark Streaming and MLlib - Hyderabad Spark Group

Spark Streaming: Demo● What is the world talking about?: A Twitter stream analysis

Page 11: Spark Streaming and MLlib - Hyderabad Spark Group

Spark MLlib: Just analytics not enuf!● Practical, scalable ML library with

implementations of several common

algorithms and more being added

● Alternative spark.ml high-level API based

on spark.sql.DataFrame. Out of scope

for our current talk.

Page 12: Spark Streaming and MLlib - Hyderabad Spark Group

Spark MLlib: Demo● Let’s see what movies you might like: A demo of Spark MLlib by building a

Movie Recommendation Engine

Page 13: Spark Streaming and MLlib - Hyderabad Spark Group

Spark: Streaming and MLlib, match made in heaven!

● MLlib provides algorithms that can learn on streaming data and simultaneously apply on the streaming data!

● Also, a large set of algorithms that

can learn offline and be applied on

the streaming data

Page 14: Spark Streaming and MLlib - Hyderabad Spark Group

Spark: What next?● Spark SQL - A SQL-like layer over RDDs● spark.ml● Spark GraphX - A graph-processing abstraction over RDDs● Apache Storm and Apache Flink - Modern streaming-first systems

Q&A

Page 15: Spark Streaming and MLlib - Hyderabad Spark Group

Thank you!Slide deck will be made available at:http://blog.zemosolabs.com/

Spark Docs are a great place to get startedhttp://spark.apache.org/docs/latest/programming-guide.html

Acknowledgements:Code demos are from Databricks TrainingMemes generated from ImgFlip.com