joker'16 spark 2 (api changes; structured streaming; encoders)

142
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM

Upload: alexey-zinoviev

Post on 13-Apr-2017

252 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

Spark 2

Alexey Zinovyev, Java/BigData Trainer in EPAM

Page 2: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With EPAM since 2015

Page 3: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

3Spark 2 from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 4: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

4Spark 2 from Zinoviev Alexey

Sprk Dvlprs! Let’s start!

Page 5: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

5Spark 2 from Zinoviev Alexey

< SPARK 2.0

Page 6: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

6Spark 2 from Zinoviev Alexey

Modern Java in 2016Big Data in 2014

Page 7: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

7Spark 2 from Zinoviev Alexey

Big Data in 2017

Page 8: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

8Spark 2 from Zinoviev Alexey

Machine Learning EVERYWHERE

Page 9: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

9Spark 2 from Zinoviev Alexey

Machine Learning vs Traditional Programming

Page 10: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

10Spark 2 from Zinoviev Alexey

Page 11: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

11Spark 2 from Zinoviev Alexey

Something wrong with HADOOP

Page 12: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

12Spark 2 from Zinoviev Alexey

Hadoop is not SEXY

Page 13: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

13Spark 2 from Zinoviev Alexey

Whaaaat?

Page 14: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

14Spark 2 from Zinoviev Alexey

Map Reduce Job Writing

Page 15: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

15Spark 2 from Zinoviev Alexey

MR code

Page 16: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

16Spark 2 from Zinoviev Alexey

Hadoop Developers Right Now

Page 17: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

17Spark 2 from Zinoviev Alexey

Iterative Calculations

10x – 100x

Page 18: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

18Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 19: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

19Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 20: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

20Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 21: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

21Spark 2 from Zinoviev Alexey

SPARK INTRO

Page 22: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

22Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

val textFile = sc.textFile("hdfs://...")

Page 23: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

23Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

val textFile = sc.textFile("hdfs://...")

Page 24: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

24Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

Page 25: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

25Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

Page 26: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

26Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

Page 27: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

27Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 28: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

28Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 29: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

29Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 30: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

30Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

Page 31: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

31Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

Page 32: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

32Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

Page 33: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

33Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

Page 34: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

34Spark 2 from Zinoviev Alexey

RDD Factory

Page 35: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

35Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 36: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

36Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 37: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

37Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 38: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

38Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 39: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

39Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

Page 40: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

40Spark 2 from Zinoviev Alexey

It’s very easy to understand Graph algorithms

def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:EdgeDirection)(vprog: (VertexID, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],mergeMsg: (A, A) => A)

: Graph[VD, ED] = { <Your Pregel Computations> }

Page 41: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

41Spark 2 from Zinoviev Alexey

SPARK 2.0 DISCUSSION

Page 42: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

42Spark 2 from Zinoviev Alexey

Spark

Family

Page 43: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

43Spark 2 from Zinoviev Alexey

Spark

Family

Page 44: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

44Spark 2 from Zinoviev Alexey

Case #0 : How to avoid DStreams with RDD-like API?

Page 45: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

45Spark 2 from Zinoviev Alexey

Continuous Applications

Page 46: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

46Spark 2 from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

Page 47: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

47Spark 2 from Zinoviev Alexey

Write Batches

Page 48: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

48Spark 2 from Zinoviev Alexey

Catch Streaming

Page 49: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

49Spark 2 from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

Page 50: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

50Spark 2 from Zinoviev Alexey

Batch

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

Page 51: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

51Spark 2 from Zinoviev Alexey

Real

Time

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

Page 52: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

52Spark 2 from Zinoviev Alexey

Unlimited Table

Page 53: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

53Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 54: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

54Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 55: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

55Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 56: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

56Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

Page 57: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

57Spark 2 from Zinoviev Alexey

WordCount with Structured Streaming

Page 58: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

58Spark 2 from Zinoviev Alexey

Structured Streaming provides …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

Page 59: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

59Spark 2 from Zinoviev Alexey

Structured Streaming provides (in dreams) …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

Page 60: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

60Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Page 61: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

61Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

org.apache.spark.sql.

AnalysisException:

Union between streaming

and batch

DataFrames/Datasets is not

supported;

Page 62: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

62Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Go to UnsupportedOperationChecker.scala and check your

operation

Page 63: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

63Spark 2 from Zinoviev Alexey

Case #1 : We should think about optimization in RDD terms

Page 64: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

64Spark 2 from Zinoviev Alexey

Single

Thread

collection

Page 65: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

65Spark 2 from Zinoviev Alexey

No perf

issues,

right?

Page 66: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

66Spark 2 from Zinoviev Alexey

The main concept

more partitions = more parallelism

Page 67: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

67Spark 2 from Zinoviev Alexey

Do it

parallel

Page 68: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

68Spark 2 from Zinoviev Alexey

I’d like

NARROW

Page 69: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

69Spark 2 from Zinoviev Alexey

Map, filter, filter

Page 70: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

70Spark 2 from Zinoviev Alexey

GroupByKey, join

Page 71: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

71Spark 2 from Zinoviev Alexey

Case #2 : DataFrames suggest mix SQL and Scala functions

Page 72: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

72Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 73: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

73Spark 2 from Zinoviev Alexey

RDD

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 74: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

74Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 75: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

75Spark 2 from Zinoviev Alexey

SQL

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 76: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

76Spark 2 from Zinoviev Alexey

Expression

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 77: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

77Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 78: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

78Spark 2 from Zinoviev Alexey

DataSet

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 79: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

79Spark 2 from Zinoviev Alexey

Case #2 : DataFrame is referring to data attributes by name

Page 80: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

80Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

Page 81: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

81Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

Page 82: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

82Spark 2 from Zinoviev Alexey

Structured APIs in SPARK

Page 83: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

83Spark 2 from Zinoviev Alexey

Unified API in Spark 2.0

DataFrame = Dataset[Row]

Dataframe is a schemaless (untyped) Dataset now

Page 84: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

84Spark 2 from Zinoviev Alexey

Define

case class

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 85: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

85Spark 2 from Zinoviev Alexey

Read JSON

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 86: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

86Spark 2 from Zinoviev Alexey

Filter by

Field

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 87: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

87Spark 2 from Zinoviev Alexey

Case #3 : Spark has many contexts

Page 88: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

88Spark 2 from Zinoviev Alexey

Spark Session

• New entry point in spark for creating datasets

• Replaces SQLContext, HiveContext and StreamingContext

• Move from SparkContext to SparkSession signifies move

away from RDD

Page 89: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

89Spark 2 from Zinoviev Alexey

Spark

Session

val sparkSession = SparkSession.builder

.master("local")

.appName("spark session example")

.getOrCreate()

val df = sparkSession.read

.option("header","true")

.csv("src/main/resources/names.csv")

df.show()

Page 90: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

90Spark 2 from Zinoviev Alexey

No, I want to create my lovely RDDs

Page 91: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

91Spark 2 from Zinoviev Alexey

Where’s parallelize() method?

Page 92: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

92Spark 2 from Zinoviev Alexey

RDD?

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 93: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

93Spark 2 from Zinoviev Alexey

Case #4 : Spark uses Java serialization A LOT

Page 94: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

94Spark 2 from Zinoviev Alexey

Two choices to distribute data across cluster

• Java serialization

By default with ObjectOutputStream

• Kryo serialization

Should register classes (no support of Serialazible)

Page 95: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

95Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Page 96: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

96Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Don’t forget about GC

Page 97: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

97Spark 2 from Zinoviev Alexey

Tungsten Compact Encoding

Page 98: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

98Spark 2 from Zinoviev Alexey

Encoder’s concept

Generate bytecode to interact with off-heap

&

Give access to attributes without ser/deser

Page 99: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

99Spark 2 from Zinoviev Alexey

Encoders

Page 100: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

100Spark 2 from Zinoviev Alexey

No custom encoders

Page 101: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

101Spark 2 from Zinoviev Alexey

Case #5 : Not enough storage levels

Page 102: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

102Spark 2 from Zinoviev Alexey

Caching in Spark

• Frequently used RDD can be stored in memory

• One method, one short-cut: persist(), cache()

• SparkContext keeps track of cached RDD

• Serialized or deserialized Java objects

Page 103: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

103Spark 2 from Zinoviev Alexey

Full list of options

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 104: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

104Spark 2 from Zinoviev Alexey

Spark Core Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 105: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

105Spark 2 from Zinoviev Alexey

Spark Streaming Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER (default for Spark Streaming)

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 106: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

106Spark 2 from Zinoviev Alexey

Developer API to make new Storage Levels

Page 107: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

107Spark 2 from Zinoviev Alexey

What’s the most popular file format in BigData?

Page 108: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

108Spark 2 from Zinoviev Alexey

Case #6 : External libraries to read CSV

Page 109: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

109Spark 2 from Zinoviev Alexey

Easy to

read CSV

data = sqlContext.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/datasets/samples/users.csv")

data.cache()

data.createOrReplaceTempView(“users")

display(data)

Page 110: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

110Spark 2 from Zinoviev Alexey

Case #7 : How to measure Spark performance?

Page 111: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

111Spark 2 from Zinoviev Alexey

You'd measure performance!

Page 112: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

112Spark 2 from Zinoviev Alexey

TPCDS

99 Queries

http://bit.ly/2dObMsH

Page 113: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

113Spark 2 from Zinoviev Alexey

Page 114: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

114Spark 2 from Zinoviev Alexey

How to benchmark Spark

Page 115: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

115Spark 2 from Zinoviev Alexey

Special Tool from Databricks

Benchmark Tool for SparkSQL

https://github.com/databricks/spark-sql-perf

Page 116: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

116Spark 2 from Zinoviev Alexey

Spark 2 vs Spark 1.6

Page 117: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

117Spark 2 from Zinoviev Alexey

Case #8 : What’s faster: SQL or DataSet API?

Page 118: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

118Spark 2 from Zinoviev Alexey

Job Stages in old Spark

Page 119: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

119Spark 2 from Zinoviev Alexey

Scheduler Optimizations

Page 120: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

120Spark 2 from Zinoviev Alexey

Catalyst Optimizer for DataFrames

Page 121: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

121Spark 2 from Zinoviev Alexey

Unified Logical Plan

DataFrame = Dataset[Row]

Page 122: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

122Spark 2 from Zinoviev Alexey

Bytecode

Page 123: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

123Spark 2 from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

Page 124: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

124Spark 2 from Zinoviev Alexey

Case #9 : Why does explain() show so many Tungsten things?

Page 125: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

125Spark 2 from Zinoviev Alexey

How to be effective with CPU

• Runtime code generation

• Exploiting cache locality

• Off-heap memory management

Page 126: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

126Spark 2 from Zinoviev Alexey

Tungsten’s goal

Push performance closer to the limits of modern

hardware

Page 127: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

127Spark 2 from Zinoviev Alexey

Maybe something UNSAFE?

Page 128: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

128Spark 2 from Zinoviev Alexey

UnsafeRowFormat

• Bit set for tracking null values

• Small values are inlined

• For variable-length values are stored relative offset into the

variablelength data section

• Rows are always 8-byte word aligned

• Equality comparison and hashing can be performed on raw

bytes without requiring additional interpretation

Page 129: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

129Spark 2 from Zinoviev Alexey

Case #10 : Can I influence on Memory Management in Spark?

Page 130: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

130Spark 2 from Zinoviev Alexey

Case #11 : Should I tune generation’s stuff?

Page 131: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

131Spark 2 from Zinoviev Alexey

Cached

Data

Page 132: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

132Spark 2 from Zinoviev Alexey

During

operations

Page 133: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

133Spark 2 from Zinoviev Alexey

For your

needs

Page 134: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

134Spark 2 from Zinoviev Alexey

For Dark

Lord

Page 135: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

135Spark 2 from Zinoviev Alexey

IN CONCLUSION

Page 136: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

136Spark 2 from Zinoviev Alexey

We have no ability…

• join structured streaming and other sources to handle it

• one unified ML API

• GraphX rethinking and redesign

• Custom encoders

• Datasets everywhere

• integrate with something important

Page 137: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

137Spark 2 from Zinoviev Alexey

Roadmap

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• Tungsten: custom encoders

• The RDD-based API is expected to be removed in Spark

3.0

Page 138: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

138Spark 2 from Zinoviev Alexey

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

Page 139: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

139Spark 2 from Zinoviev Alexey

First Part

Page 140: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

140Spark 2 from Zinoviev Alexey

Second

Part

Page 141: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

141Spark 2 from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 142: Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

142Spark 2 from Zinoviev Alexey

Any questions?