joker'16 spark 2 (api changes; structured streaming; encoders)

Post on 13-Apr-2017

252 Views

Category:

Engineering

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spark 2

Alexey Zinovyev, Java/BigData Trainer in EPAM

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With EPAM since 2015

3Spark 2 from Zinoviev Alexey

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

4Spark 2 from Zinoviev Alexey

Sprk Dvlprs! Let’s start!

5Spark 2 from Zinoviev Alexey

< SPARK 2.0

6Spark 2 from Zinoviev Alexey

Modern Java in 2016Big Data in 2014

7Spark 2 from Zinoviev Alexey

Big Data in 2017

8Spark 2 from Zinoviev Alexey

Machine Learning EVERYWHERE

9Spark 2 from Zinoviev Alexey

Machine Learning vs Traditional Programming

10Spark 2 from Zinoviev Alexey

11Spark 2 from Zinoviev Alexey

Something wrong with HADOOP

12Spark 2 from Zinoviev Alexey

Hadoop is not SEXY

13Spark 2 from Zinoviev Alexey

Whaaaat?

14Spark 2 from Zinoviev Alexey

Map Reduce Job Writing

15Spark 2 from Zinoviev Alexey

MR code

16Spark 2 from Zinoviev Alexey

Hadoop Developers Right Now

17Spark 2 from Zinoviev Alexey

Iterative Calculations

10x – 100x

18Spark 2 from Zinoviev Alexey

MapReduce vs Spark

19Spark 2 from Zinoviev Alexey

MapReduce vs Spark

20Spark 2 from Zinoviev Alexey

MapReduce vs Spark

21Spark 2 from Zinoviev Alexey

SPARK INTRO

22Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

val textFile = sc.textFile("hdfs://...")

23Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

val textFile = sc.textFile("hdfs://...")

24Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

25Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

26Spark 2 from Zinoviev Alexey

Loading

val localData = (5,7,1,12,10,25)

val ourFirstRDD = sc.parallelize(localData)

// from file

val textFile = sc.textFile("hdfs://...")

27Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

28Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

29Spark 2 from Zinoviev Alexey

Word

Count

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

30Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

31Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

32Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

33Spark 2 from Zinoviev Alexey

SQL

val hive = new HiveContext(spark)

hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)

val results = hive.hql(“FROM src SELECT key, value”).collect()

34Spark 2 from Zinoviev Alexey

RDD Factory

35Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

36Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

37Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

38Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

39Spark 2 from Zinoviev Alexey

DStream

val conf = new SparkConf().setMaster("local[2]")

.setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

40Spark 2 from Zinoviev Alexey

It’s very easy to understand Graph algorithms

def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:EdgeDirection)(vprog: (VertexID, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],mergeMsg: (A, A) => A)

: Graph[VD, ED] = { <Your Pregel Computations> }

41Spark 2 from Zinoviev Alexey

SPARK 2.0 DISCUSSION

42Spark 2 from Zinoviev Alexey

Spark

Family

43Spark 2 from Zinoviev Alexey

Spark

Family

44Spark 2 from Zinoviev Alexey

Case #0 : How to avoid DStreams with RDD-like API?

45Spark 2 from Zinoviev Alexey

Continuous Applications

46Spark 2 from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

47Spark 2 from Zinoviev Alexey

Write Batches

48Spark 2 from Zinoviev Alexey

Catch Streaming

49Spark 2 from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

50Spark 2 from Zinoviev Alexey

Batch

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

51Spark 2 from Zinoviev Alexey

Real

Time

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

52Spark 2 from Zinoviev Alexey

Unlimited Table

53Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

54Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

55Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

56Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

57Spark 2 from Zinoviev Alexey

WordCount with Structured Streaming

58Spark 2 from Zinoviev Alexey

Structured Streaming provides …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

59Spark 2 from Zinoviev Alexey

Structured Streaming provides (in dreams) …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

60Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

61Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

org.apache.spark.sql.

AnalysisException:

Union between streaming

and batch

DataFrames/Datasets is not

supported;

62Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Go to UnsupportedOperationChecker.scala and check your

operation

63Spark 2 from Zinoviev Alexey

Case #1 : We should think about optimization in RDD terms

64Spark 2 from Zinoviev Alexey

Single

Thread

collection

65Spark 2 from Zinoviev Alexey

No perf

issues,

right?

66Spark 2 from Zinoviev Alexey

The main concept

more partitions = more parallelism

67Spark 2 from Zinoviev Alexey

Do it

parallel

68Spark 2 from Zinoviev Alexey

I’d like

NARROW

69Spark 2 from Zinoviev Alexey

Map, filter, filter

70Spark 2 from Zinoviev Alexey

GroupByKey, join

71Spark 2 from Zinoviev Alexey

Case #2 : DataFrames suggest mix SQL and Scala functions

72Spark 2 from Zinoviev Alexey

History of Spark APIs

73Spark 2 from Zinoviev Alexey

RDD

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

74Spark 2 from Zinoviev Alexey

History of Spark APIs

75Spark 2 from Zinoviev Alexey

SQL

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

76Spark 2 from Zinoviev Alexey

Expression

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

77Spark 2 from Zinoviev Alexey

History of Spark APIs

78Spark 2 from Zinoviev Alexey

DataSet

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

79Spark 2 from Zinoviev Alexey

Case #2 : DataFrame is referring to data attributes by name

80Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

81Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

82Spark 2 from Zinoviev Alexey

Structured APIs in SPARK

83Spark 2 from Zinoviev Alexey

Unified API in Spark 2.0

DataFrame = Dataset[Row]

Dataframe is a schemaless (untyped) Dataset now

84Spark 2 from Zinoviev Alexey

Define

case class

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

85Spark 2 from Zinoviev Alexey

Read JSON

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

86Spark 2 from Zinoviev Alexey

Filter by

Field

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

87Spark 2 from Zinoviev Alexey

Case #3 : Spark has many contexts

88Spark 2 from Zinoviev Alexey

Spark Session

• New entry point in spark for creating datasets

• Replaces SQLContext, HiveContext and StreamingContext

• Move from SparkContext to SparkSession signifies move

away from RDD

89Spark 2 from Zinoviev Alexey

Spark

Session

val sparkSession = SparkSession.builder

.master("local")

.appName("spark session example")

.getOrCreate()

val df = sparkSession.read

.option("header","true")

.csv("src/main/resources/names.csv")

df.show()

90Spark 2 from Zinoviev Alexey

No, I want to create my lovely RDDs

91Spark 2 from Zinoviev Alexey

Where’s parallelize() method?

92Spark 2 from Zinoviev Alexey

RDD?

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

93Spark 2 from Zinoviev Alexey

Case #4 : Spark uses Java serialization A LOT

94Spark 2 from Zinoviev Alexey

Two choices to distribute data across cluster

• Java serialization

By default with ObjectOutputStream

• Kryo serialization

Should register classes (no support of Serialazible)

95Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

96Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Don’t forget about GC

97Spark 2 from Zinoviev Alexey

Tungsten Compact Encoding

98Spark 2 from Zinoviev Alexey

Encoder’s concept

Generate bytecode to interact with off-heap

&

Give access to attributes without ser/deser

99Spark 2 from Zinoviev Alexey

Encoders

100Spark 2 from Zinoviev Alexey

No custom encoders

101Spark 2 from Zinoviev Alexey

Case #5 : Not enough storage levels

102Spark 2 from Zinoviev Alexey

Caching in Spark

• Frequently used RDD can be stored in memory

• One method, one short-cut: persist(), cache()

• SparkContext keeps track of cached RDD

• Serialized or deserialized Java objects

103Spark 2 from Zinoviev Alexey

Full list of options

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

104Spark 2 from Zinoviev Alexey

Spark Core Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

105Spark 2 from Zinoviev Alexey

Spark Streaming Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER (default for Spark Streaming)

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

106Spark 2 from Zinoviev Alexey

Developer API to make new Storage Levels

107Spark 2 from Zinoviev Alexey

What’s the most popular file format in BigData?

108Spark 2 from Zinoviev Alexey

Case #6 : External libraries to read CSV

109Spark 2 from Zinoviev Alexey

Easy to

read CSV

data = sqlContext.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/datasets/samples/users.csv")

data.cache()

data.createOrReplaceTempView(“users")

display(data)

110Spark 2 from Zinoviev Alexey

Case #7 : How to measure Spark performance?

111Spark 2 from Zinoviev Alexey

You'd measure performance!

112Spark 2 from Zinoviev Alexey

TPCDS

99 Queries

http://bit.ly/2dObMsH

113Spark 2 from Zinoviev Alexey

114Spark 2 from Zinoviev Alexey

How to benchmark Spark

115Spark 2 from Zinoviev Alexey

Special Tool from Databricks

Benchmark Tool for SparkSQL

https://github.com/databricks/spark-sql-perf

116Spark 2 from Zinoviev Alexey

Spark 2 vs Spark 1.6

117Spark 2 from Zinoviev Alexey

Case #8 : What’s faster: SQL or DataSet API?

118Spark 2 from Zinoviev Alexey

Job Stages in old Spark

119Spark 2 from Zinoviev Alexey

Scheduler Optimizations

120Spark 2 from Zinoviev Alexey

Catalyst Optimizer for DataFrames

121Spark 2 from Zinoviev Alexey

Unified Logical Plan

DataFrame = Dataset[Row]

122Spark 2 from Zinoviev Alexey

Bytecode

123Spark 2 from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

124Spark 2 from Zinoviev Alexey

Case #9 : Why does explain() show so many Tungsten things?

125Spark 2 from Zinoviev Alexey

How to be effective with CPU

• Runtime code generation

• Exploiting cache locality

• Off-heap memory management

126Spark 2 from Zinoviev Alexey

Tungsten’s goal

Push performance closer to the limits of modern

hardware

127Spark 2 from Zinoviev Alexey

Maybe something UNSAFE?

128Spark 2 from Zinoviev Alexey

UnsafeRowFormat

• Bit set for tracking null values

• Small values are inlined

• For variable-length values are stored relative offset into the

variablelength data section

• Rows are always 8-byte word aligned

• Equality comparison and hashing can be performed on raw

bytes without requiring additional interpretation

129Spark 2 from Zinoviev Alexey

Case #10 : Can I influence on Memory Management in Spark?

130Spark 2 from Zinoviev Alexey

Case #11 : Should I tune generation’s stuff?

131Spark 2 from Zinoviev Alexey

Cached

Data

132Spark 2 from Zinoviev Alexey

During

operations

133Spark 2 from Zinoviev Alexey

For your

needs

134Spark 2 from Zinoviev Alexey

For Dark

Lord

135Spark 2 from Zinoviev Alexey

IN CONCLUSION

136Spark 2 from Zinoviev Alexey

We have no ability…

• join structured streaming and other sources to handle it

• one unified ML API

• GraphX rethinking and redesign

• Custom encoders

• Datasets everywhere

• integrate with something important

137Spark 2 from Zinoviev Alexey

Roadmap

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• Tungsten: custom encoders

• The RDD-based API is expected to be removed in Spark

3.0

138Spark 2 from Zinoviev Alexey

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

139Spark 2 from Zinoviev Alexey

First Part

140Spark 2 from Zinoviev Alexey

Second

Part

141Spark 2 from Zinoviev Alexey

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

142Spark 2 from Zinoviev Alexey

Any questions?

top related