structured streams in spark 2 - files.meetup.comfiles.meetup.com/19158234/slides long...

29
Structured Streams in Spark 2.0 Long Tran

Upload: others

Post on 22-May-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Structured Streams in Spark 2.0

Long Tran

Page 2: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

• RDDs • RDD / Scala API

• D-Streams (0.7) • SQL (1.3)

• Dataframes API • Catalyst • Tungsten (1.4)

• Structured Streams (finally!) (2.0) • File API - future APIs • Exactly Once semantics

Overview

Page 3: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://spark.apache.org/

Page 4: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

RDD

A parallelized, lazily evaluated, directed acyclic graph of computation.

Resilient Distributed Dataset

Page 5: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://dzone.com/refcardz/apache-spark

Page 6: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

RDD + Scala Word CountSCALA

SPARK

Page 7: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides
Page 8: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Spark Streaming

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 9: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DStream API

Page 10: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DStream Word Count

Page 11: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

SQL & DataFramesAPI for computing structured data

Page 12: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

DataFrames and SQL Word Count

DataFrames

SQL

Page 13: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Page 14: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

Page 15: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Catalyst

Page 16: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Tungsten

www.slideshare.net/databricks/2015-0616-spark-summit

Memory Management and Binary Processing

Page 17: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

TungstenCache Aware Computation

Page 18: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

\

Page 19: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Spark 2.0 is the ALPHA RELEASE of Structured

StreamingStructured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-

once stream processing without the user having to reason about streaming.

Page 20: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Classic Streaming

Page 21: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Continuous Applications

Page 22: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png

Programming Model

Page 23: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides
Page 24: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Structured Stream Word Count

Page 25: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

end-to-end exactly once guarantees

Page 26: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Conclusions• Dataframe and SQL for streaming • Catalyst! • Tungsten! • Unified API for batch and streaming (+ ML +

GraphFrames) • BIs, DBAs, Data Scientists can now do streaming! • Exactly once guarantees • No need to reason about intervals • Event Time primitives

Page 27: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Future• Current support for reading file streams only • Kafka Integration (2.1) • Public API for sources and sinks • Watermarks • ML Integration - continuously updated models

Page 28: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

@LooooongTran

Page 29: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides

Sources• https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-

spark.html • https://databricks.com/blog/2016/07/28/continuous-applications-evolving-

streaming-in-apache-spark-2-0.html • https://www.youtube.com/watch?v=rl8dIzTpxrI • https://www.youtube.com/watch?v=fn3WeMZZcCk • https://spark.apache.org/docs/latest/structured-streaming-programming-

guide.html • https://www.oreilly.com/learning/apache-spark-2-0--introduction-to-structured-

streaming • https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-

closer-to-bare-metal.html • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-

optimizer.html • https://www.youtube.com/watch?v=1a4pgYzeFwE • https://www.youtube.com/watch?v=5ajs8EIPWGI • http://www.kdnuggets.com/2016/05/spark-tungsten-burns-brighter.html • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-

whole-stage-codegen.html