deep dive: how spark uses memory -...

79
Deep Dive: How Spark Uses Memory Wenchen Fan 2017-5-19

Upload: others

Post on 20-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Deep Dive: How Spark Uses Memory

Wenchen Fan 2017-5-19

Page 2: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Agenda

• Memory Usage Overview

• Memory Contention

• Tungsten Memory Format

• Cache-aware Computation

• Future Plans

Page 3: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will
Page 4: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Where Spark Uses Memory

• storage: memory used to cache data that will be used later.

(controlled by memory manager)

• execution: memory used for computation in shuffles, joins,

sorts and aggregations. (controlled by memory manager)

• others: user data structure, internal metadata, objects

created by UDF, etc.

Page 5: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Iterator

4, 3, 5, 1, 6,

2

Sort

1 2 3 4 5 6 Iterator

1, 2, 3, 4, 5,

6 take(3)

Execution

Memory

What if I want the sorted values again?

Page 6: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Iterator

4, 3, 5, 1, 6,

2

Sort

1 2 3 4 5 6 Iterator

1, 2, 3, 4, 5,

6 take(3)

Iterator

4, 3, 5, 1, 6,

2

Sort

1 2 3 4 5 6 Iterator

1, 2, 3, 4, 5,

6 take(4)

. . .

Page 7: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Iterator

4, 3, 5, 1, 6,

2

Sort

1 2 3 4 5 6 Iterator

1, 2, 3, 4, 5,

6

Execution

Memory

Cache

1 2 3 4 5 6

take(3) take(4) take(5) …

Storage

Memory

Page 8: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Memory Contention

• How to arbitrate memory between execution and storage?

• How to arbitrate memory across tasks running in parallel?

• How to arbitrate memory across operators running within the

same task?

Page 9: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Challenge #1

How to arbitrate memory between

execution and storage?

Page 10: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

total available memory

execution storage

Page 11: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

execution storage

Spill to

Disk

Page 12: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

execution storage

Evict LRU blocks to

disk

Page 13: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Inefficient memory usage

leads to bad performance

Page 14: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

execution storage

Execution can only use a fraction of the

memory, even when there is no storage!

Page 15: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

execution storage

Efficient use of memory required user tuning

Page 16: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Unified Memory Management

execution storage

What happens if there is already

storage?

Page 17: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Unified Memory Management

execution storage

Evict LRU blocks to

disk

Page 18: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Unified Memory Management

execution storage

Page 19: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Design Considerations

• Why evict storage, not execution?

• Spilled execution data will always be read back from disk, where as

cached data may not.

• What if the application relies on cache?

Page 20: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Unified Memory Management

execution storage

This is bad!

Page 21: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Design Considerations

• Why evict storage, not execution?

• Spilled execution data will always be read back from disk, where as

cached data may not.

• What if the application relies on cache?

• allow users to specify a minimum unevictable amount of cached

data(not a reservation!)

Page 22: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Challenge #2

How to arbitrate memory across tasks

running in parallel?

Page 23: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Easy, static assignment!

Task 1

Worker machine has 4

cores Each task gets ¼ of the total

memory

Task 2 Task 3 Task 4

Page 24: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Task 1

The share of each task depends on the

number of actively running tasks

Page 25: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Task 1

The share of each task depends on the

number of actively running tasks

Page 26: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Task 1

Now another task comes along, the

first task have to spill to free up

memory

Page 27: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Task 1

Each task is now assigned ½ of the

total memory

Task 2

Page 28: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Each task is now assigned ¼ of the

total memory

Task 1 Task 2 Task 3 Task 4

Page 29: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Dynamic Assignment

Task 3

Last remaining task gets all the

memory

Page 30: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Static vs Dynamic Assignment

• Both are fair and starvation free

• Static Assignment is simpler

• Dynamic assignment handles stragglers better

Page 31: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Challenge #3

How to arbitrate memory across

operators running within the same task?

Page 32: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

SELECT age, AVG(height) FROM students GROUP BY age ORDER BY AVG(height)

students.groupBy("age") .avg("height") .orderBy("avg(height)") .collect() Scan

Project

Aggregate

Sort

Page 33: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort

The task has 6

pages of memory

Page 34: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort

Map { // age → (total,

count)

20 → (483, 3)

21 → (935, 5)

22 → (172, 1)

}

Page 35: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort

All 6 pages were

used by

Aggregate,

leaving no

memory for Sort!

Page 36: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #1

Reserve a page

for each operator

Page 37: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #1

Reserve a page

for each operator

Starvation free, but still not fair…

What if there were more operators?

Page 38: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #2

Cooperative spilling

Page 39: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #2

Cooperative spilling

Page 40: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #2

Cooperative spilling

Sort forces Aggregate to

spill a page to free

memory

Page 41: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #2

Cooperative spilling

Sort needs more memory so

it forces Aggregate to spill

another page(and so on)

Page 42: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Scan

Project

Aggregate

Sort Solution #2

Cooperative spilling

Sort finishes with 3 pages

Aggregate does not have to

spill its remaining pages

Page 43: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Recap: three source of contention

How to arbitrate memory …

• between execution and storage?

• across tasks running in parallel?

• across operators running with the same task?

Instead of statically reserving memory in advance, deal

with memory contention when it raises by forcing

members to spill

Page 44: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

How Spark keep data in memory

• Put data as objects on the heap and operate on these

objects.

• Data caching is simply using a list to keep data objects.

Page 45: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Data objects? No!

• It is hard to monitor and control the memory usage when we have a lot of objects.

• Garbage collection will be the killer.

• Java objects has notable space overhead.

• High serialization cost when transfer data inside cluster.

Page 46: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Keep data as binary and operate on

binary

data

source

1010100001010 0010001001001 1010001000111 1010100100101

binary format

Cache Manager

Memory Manager

operators

allocate memor

y

memory pages allocate

memory memor

y pages

Page 47: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Java Objects Based Row Format

Array

BoxedInteger(123)

String(“data”)

String(“bricks”)

Row

• 5+ objects

• high space overhead

• slow value accessing

• expensive hashCode()

Page 48: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Efficient Binary Format

“bricks” 0x0 123 32 40 “data” 4 6

null tracking

bitmap

(123, “data”, “bricks”)

offset and length of

data

offset and length of

data

Page 49: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Efficient Binary Format

0x

0

12

3

2

4 4

“data

0x

0

12

3

2

4 4

“data

0x

0

1

6 2 “da”

123 >

0?

JSON

files

substring

Page 50: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

How to process binary

data more efficient?

Page 51: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Understanding CPU Cache

Memory is becoming

slower and slower

than CPU, we should

keep the frequently

accessed data in

CPU cache.

Page 52: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Understanding CPU Cache

Pre-fetch data into

CPU cache, with

cache line boundary.

Page 53: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

The most 2 important

techniques in big data

are ...

Sort and Hash!

Page 54: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Sort

record

record

record

pointer

pointer

pointer

Page 55: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Sort

record

record

record

pointer

pointer

pointer

Page 56: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Sort

record

record

record

pointer

pointer

pointer

Page 57: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Sort

record

record

record

pointer

pointer

pointer

Page 58: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Sort

Each comparison needs to access 2 different

memory regions, which makes it hard for CPU cache

to pre-fetch data, poor cache locality!

Page 59: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Cache-aware Sort

record

record

record

ptr

ptr

ptr

key-prefix

key-prefix

key-prefix

Page 60: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Cache-aware Sort

record

record

record

ptr

ptr

ptr

key-prefix

key-prefix

key-prefix

Page 61: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Cache-aware Sort

record

record

record

ptr

ptr

ptr

key-prefix

key-prefix

key-prefix

Page 62: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Cache-aware Sort

Most of the time, just go through the key-prefixes in a

linear fashion, good cache locality!

Page 63: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

Page 64: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

key

lookup

Page 65: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

key

hash(key) %

size

Page 66: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

key

compare these 2

keys

Page 67: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

key

linear

probing

Page 68: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

key pointer value

key pointer value

key pointer value

key

compare these 2

keys

Page 69: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Naive Hash Map

Each lookup needs many pointer dereferences and

key comparison when hash collision happens, and

jumps between 2 memory regions, bad cache locality!

Page 70: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

ptr

ptr

ptr

full hash

full hash

full hash

Cache-aware Hash Map

key value

key value

key value

Page 71: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

ptr

ptr

ptr

full hash

full hash

full hash

Cache-aware Hash Map

key value

key value

key value

key

lookup

Page 72: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

ptr

ptr

ptr

full hash

full hash

full hash

Cache-aware Hash Map

key value

key value

key value

key

hash(key) % size, and

compare the full hash

Page 73: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

ptr

ptr

ptr

full hash

full hash

full hash

Cache-aware Hash Map

key value

key value

key value

key

linear probing, and

compare the full hash

Page 74: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

ptr

ptr

ptr

full hash

full hash

full hash

Cache-aware Hash Map

key value

key value

key value

key

compare these 2

keys

Page 75: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Cache-aware Hash Map

Each lookup mostly only needs one pointer

dereference and key comparison(full hash collision is

rare), and access data in a single memory region,

better cache locality!

Page 76: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Recap: Cache-aware data structure

How to improve cache locality …

• store key-prefix with pointer.

• store key full hash with pointer.

Store extra information to try to keep the

memory accessing in a single region.

Page 77: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

What’s next

• Standard binary format, may use Apache Arrow.

• SPARK-19489

• SPARK-13534

• Columnar execution engine.

• SPARK-15687

Page 78: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

Thank You

Page 79: Deep Dive: How Spark Uses Memory - pic.huodongjia.compic.huodongjia.com/ganhuodocs/2017-06-15/1497513030.89.pdfWhere Spark Uses Memory •storage: memory used to cache data that will

We Are Hiring!!!

Send your resume to

[email protected]

Work at Hangzhou,

full time Spark developer!