apache spark performance: past, future and present

45
Spark Performance Past, Future, and Present Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun

Upload: databricks

Post on 21-Jan-2018

448 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Apache Spark Performance: Past, Future and Present

Spark Performance Past, Future, and Present

Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun

Page 2: Apache Spark Performance: Past, Future and Present

About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley

Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)

Page 3: Apache Spark Performance: Past, Future and Present

How can I make this faster?

Page 4: Apache Spark Performance: Past, Future and Present

How can I make this faster?

Page 5: Apache Spark Performance: Past, Future and Present

Should I use a different cloud instance type?

Page 6: Apache Spark Performance: Past, Future and Present

Should I trade more CPU for less

I/O by using better

compression?

Page 7: Apache Spark Performance: Past, Future and Present

How can I make this faster?

???

Page 8: Apache Spark Performance: Past, Future and Present

How can I make this faster?

???

Page 9: Apache Spark Performance: Past, Future and Present

How can I make this faster?

???

Page 10: Apache Spark Performance: Past, Future and Present

Major performance improvements possible via tuning, configuration

…if only you knew which knobs to turn

Page 11: Apache Spark Performance: Past, Future and Present

Past: Performance instrumentation in Spark

Future: New architecture that provides performance clarity

Present: Improving Spark’s performance instrumentation

This talk

Page 12: Apache Spark Performance: Past, Future and Present

spark.textFile(“hdfs://…”) \

.flatMap(lambda l: l.split(“ “)) \

.map(lambda w: (w, 1)) \

.reduceByKey(lambda a, b: a + b) \

.saveAsTextFile(“hdfs://…”)

Example Spark Job

Split input file into words and emit count of 1 for each

Word Count:

Page 13: Apache Spark Performance: Past, Future and Present

Example Spark Job

Split input file into words and emit count of 1 for each

Word Count:

For each word, combine the counts, and save the output

spark.textFile(“hdfs://…”) \

.flatMap(lambda l: l.split(“ “)) \

.map(lambda w: (w, 1)) \

.reduceByKey(lambda a, b: a + b) \

.saveAsTextFile(“hdfs://…”)

Page 14: Apache Spark Performance: Past, Future and Present

spark.textFile(“hdfs://…”)

.flatMap(lambda l: l.split(“ “))

.map(lambda w: (w, 1))

Map Stage: Split input file into words and emit count of 1 for each

Reduce Stage: For each word, combine the counts, and save the output

Spark Word Count Job: .reduceByKey(lambda a, b: a + b)

.saveAsTextFile(“hdfs://…”)

Worker 1

Worker n

Tasks

Worker 1

Worker n

Page 15: Apache Spark Performance: Past, Future and Present

Spark Word Count Job:

Reduce Stage: For each word, combine the counts, and save the output

.reduceByKey(lambda a, b: a + b)

.saveAsTextFile(“hdfs://…”)

Worker 1

Worker n

Page 16: Apache Spark Performance: Past, Future and Present

compute

network

time

(1) Request a few shuffle blocks

disk

(5) Continue fetching remote data

: time to handle one shuffle block

(2) Process local data

What happens in a reduce task?

(4) Process data fetched remotely

(3) Write output to disk

Page 17: Apache Spark Performance: Past, Future and Present

compute

network

time disk

: time to handle one shuffle block What happens in a reduce task?

Bottlenecked on network and disk

Bottlenecked on network

Bottlenecked on CPU

Page 18: Apache Spark Performance: Past, Future and Present

compute

network

time disk

: time to handle one shuffle block What happens in a reduce task?

Bottlenecked on network and disk

Bottlenecked on network

Bottlenecked on CPU

Page 19: Apache Spark Performance: Past, Future and Present

compute

network

time disk

What instrumentation exists today? Instrumentation centered on single, main task thread

: shuffle read blocked time : executor

computing time (!)

Page 20: Apache Spark Performance: Past, Future and Present

actual What instrumentation exists today?

timeline version

Page 21: Apache Spark Performance: Past, Future and Present

What instrumentation exists today? Instrumentation centered on single, main task thread

Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented

Possible to add!

Page 22: Apache Spark Performance: Past, Future and Present

compute

disk

Instrumenting read and write time

Process shuffle block This is a lie

compute

Reality: Spark processes and then writes one record at a time

Most writes get buffered Occasionally the buffer is flushed

Page 23: Apache Spark Performance: Past, Future and Present

compute Spark processes and then writes one record at a time

Most writes get buffered Occasionally the buffer is flushed

Challenges with reality: Record-level instrumentation is too high overhead

Spark doesn’t know when buffers get flushed (HDFS does!)

Page 24: Apache Spark Performance: Past, Future and Present

Tasks use fine-grained pipelining to parallelize resources

Instrumented times are blocked times only (task is doing other things in background)

Opportunities to improve instrumentation

Page 25: Apache Spark Performance: Past, Future and Present

Past: Performance instrumentation in Spark

Future: New architecture that provides performance clarity

Present: Improving Spark’s performance instrumentation

This talk

Page 26: Apache Spark Performance: Past, Future and Present

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

4 concurrent tasks on a worker

Page 27: Apache Spark Performance: Past, Future and Present

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

Concurrent tasks may contend for

the same resource (e.g., network)

Page 28: Apache Spark Performance: Past, Future and Present

What’s the bottleneck?

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8

Time t: different tasks may be

bottlenecked on different resources

Single task may be bottlenecked on

different resources at different times

Page 29: Apache Spark Performance: Past, Future and Present

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8

How much faster would my job be with 2x disk throughput?

How would runtimes for these disk writes change?

How would that change timing of (and contention for) other resources?

Page 30: Apache Spark Performance: Past, Future and Present

Today: tasks use pipelining to parallelize multiple resources

Proposal: build systems using monotasks

that each consume just one resource

Page 31: Apache Spark Performance: Past, Future and Present

Monotasks: Each task uses one resource Network

monotask Disk monotask Compute monotask

Today’s task:

Monotasks don’t start until all dependencies complete

Task 1

Network read CPU

Disk write

Page 32: Apache Spark Performance: Past, Future and Present

Dedicated schedulers control contention

Network scheduler

CPU scheduler: 1 monotask / core

Disk drive scheduler: 1 monotask / disk

Monotasks for one of today’s tasks:

Page 33: Apache Spark Performance: Past, Future and Present

Spark today: Tasks have non-

uniform resource use

4 multi-resource tasks run

concurrently

Single-resource monotasks

scheduled by per-resource schedulers

Monotasks: API-compatible, performance

parity with Spark

Performance telemetry trivial!

Page 34: Apache Spark Performance: Past, Future and Present

How much faster would job run if...

4x more machines

Input stored in-memory No disk read

No CPU time to deserialize

Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted

with at most 23% error

�������������

����������������

��� ����� �������� �� ����� �� ���������������

��������

�������� ������� �� ��������� ���� ���� ��������������� ��� ������� ��� ��������� ���� ���� ������

������ ��� ������� ��� ��������� ���� ���� ������

Page 35: Apache Spark Performance: Past, Future and Present

Monotasks: Break jobs into single-resource tasks

Using single-resource monotasks provides clarity without sacrificing performance

Massive change to Spark internals (>20K lines of code)

Page 36: Apache Spark Performance: Past, Future and Present

Past: Performance instrumentation in Spark

Future: New architecture that provides performance clarity

Present: Improving Spark’s performance instrumentation

This talk

Page 37: Apache Spark Performance: Past, Future and Present

Spark today: Task resource use

changes at fine time granularity

4 multi-resource tasks run

concurrently

Monotasks: Single-resource tasks lead

to complete, trivial performance metrics

Can we get monotask-like per-resource metrics for each task today?

Page 38: Apache Spark Performance: Past, Future and Present

compute

network

time disk

Can we get per-task resource use?

Measure machine resource utilization?

Page 39: Apache Spark Performance: Past, Future and Present

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

4 concurrent tasks on a worker

Page 40: Apache Spark Performance: Past, Future and Present

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

Concurrent tasks may contend for

the same resource (e.g., network)

Contention controlled by

lower layers (e.g., operating system)

Page 41: Apache Spark Performance: Past, Future and Present

compute

network

time disk

Can we get per-task resource use?

Machine utilization includes other tasks

Can’t directly measure

per-task I/O: in background, mixed

with other tasks

Page 42: Apache Spark Performance: Past, Future and Present

compute

network

time disk

Can we get per-task resource use?

Existing metrics: total data read

How long did it take?

Use machine utilization metrics to get bandwidth!

Page 43: Apache Spark Performance: Past, Future and Present

Existing per-task I/O counters (e.g., shuffle bytes read) +

Machine-level utilization (and bandwidth) metrics =

Complete metrics about time spent using each resource

Page 44: Apache Spark Performance: Past, Future and Present

Goal: provide performance clarity Only way to improve performance is to know what to speed up

Why do we care about performance clarity?

Typical performance eval: group of experts Practical performance: 1 novice

Page 45: Apache Spark Performance: Past, Future and Present

Goal: provide performance clarity Only way to improve performance is to know what to speed up

Some instrumentation exists already Focuses on blocked times in the main task thread

Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS)

(2) Add machine-level utilization info (3) Calculate per-resource time

More details at kayousterhout.org