eliminating read barriers through procrastination and cleanliness kc sivaramakrishnan lukasz ziarek...

Post on 14-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Eliminating Read Barriers through Procrastination and Cleanliness

KC SivaramakrishnanLukasz Ziarek

Suresh Jagannathan

2

Big PictureLightweight user-level threads

Scheduler 1t1 t2 tn Lots of

concurrency!

Core 1 Core nCore 2

Heap

3

Big Picture

Expendable resource?

Big Picture

3

Scheduler 1t1 t2 tn Lots of

concurrency!

Heap

4

Big Picture

Expendable resource?

Big Picture

4

Scheduler 1

t1

t2 tn Lots of concurrency!

Heap

Exploit program concurrency to

eliminate read barriers from thread-local collectors

GC Operation

Alleviate MM cost?

5

MultiMLton• Goals– Safety, Scalability, ready for future manycore processors

• Parallel extension of MLton SML compiler and runtime

• Parallel extension of Concurrent ML– Lots of Concurrency!– Interact by sending messages over first-class channels

C

send (c, v)

v recv (c)

6

MultiMLton GC: Considerations• Standard ML – functional PL with side-effects– Most objects are small and ephemeral

• Independent generational GC

– # Mutations << # Reads• Keep cost of reads to be low

• Minimize NUMA effects• Run on non-cache coherent HW

7

MultiMLton GC: Design

Core

Local Heap

Core

Local Heap

Core

Local Heap

Core

Local Heap

Shared Heap

Thread-local GC

• NUMA Awareness• Circumvent cache-coherence issues

8

Invariant Preservation• Read and write barriers for preserving

invariants

Shared Heap

r

Local Heap

x

Target

Source

Exporting writes

r := x

Shared Heap

r

Local Heap

x

FWD

Transitive closure of x

Mutator needs read

barriers!

9

Challenge• Object reads are pervasive– RB overhead cost (RB) * frequency (RB)∝

• Read barrier optimization– Stacks and Registers never point to forwarded objects

20.1 %

15.3 %21.3 %

Mean Overhead----------------------

Read

bar

rier o

verh

ead

(%)

10

Mutator and Forwarded Objects

# RB invocations

# Encountered forwarded objects

< 0.00001

Eliminate read barriers altogether

11

RB Elimination• Visibility Invariant– Mutator does not encounter forwarded objects

• Observation– No forwarded objects created visibility invariant ⇒

No read barriers⇒

• Exploit concurrency Procrastination!

12

ProcrastinationShared Heap

r1

Local Heap

x1

T1 T2

r2

x2

r1 := x1 r2 := x2

T T is running

T T is suspended

T T is blocked

13

ProcrastinationShared Heap

r1

Local Heap

x1

r1 := x1

T1 T2

r2 := x2 r2

x2

Delayed write list

Control switches

to T2

T T is running

T T is suspended

T T is blocked

14

ProcrastinationShared Heap

r1

Local Heap

x1

T1 T2

r2

x2

Delayed write list

r1 := x1 r2 := x2

T T is running

T T is suspended

T T is blocked

15

ProcrastinationShared Heap

r1

Local Heap

T1 T2

r2 x2

Delayed write list

r1 := x1 r2 := x2x1

T T is running

T T is suspended

T T is blocked

FWD FWD

16

ProcrastinationShared Heap

r1

Local Heap

T1 T2

r2 x2

Delayed write list

x1

Force local GC

T T is running

T T is suspended

T T is blocked

r1 := x1 r2 := x2

17

Correctness• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!

T1 T2T2T T is running

T T is suspended

T T is blocked

18

Correctness

T1

• Is Procrastination safe?– Yes. Forcing a local GC unblocks the threads.– No deadlocks or livelocks!

T2T T is running

T T is suspended

T T is blocked

• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!

19

Correctness

T1 T2

• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!

T T is running

T T is suspended

T T is blocked• Is Procrastination safe?– Yes. Forcing a local GC unblocks the threads.– No deadlocks or livelocks!

20

• Efficacy (Procrastination) ∝ # Available runnable threads

Is Procrastination alone enough?

M

W1W1 W1

F

J

Serial (low thread availability)

Concurrent (high thread availability)

• With Procrastination, half of local major GCs were forced

Eager exporting writes while preserving visibility invariant

21

Cleanliness• A clean object closure can be lifted to the

shared heap without breaking the visibility invariant

r := xinSharedHeap (r)

inLocalHeap (x)&&

isClean (x)

Eager write (no Procrastination)

22

Cleanliness: Intuition

Shared Heap

Local Heap

x

lift (x) to shared heap

23

Shared Heap

Local Heap

Cleanliness: Intuition

x

FWD

find all references to FWD

24

Shared Heap

Local Heap

Cleanliness: Intuition

x Need to scan the entire local heap

25

Local Heap

h

Shared Heap

Cleanliness: Simpler question

x

FWD

Do all references originate from heap region h?

sizeof (h) << sizeof (local heap)

26

Local Heap

h

Shared Heap

Cleanliness: Simpler question

x Only scan theheap region h.

Heap session!

sizeof (h) << sizeof (local heap)

27

• Current session closed & new session opened– After an exporting write, a user-level context switch, a local

GC

Heap Sessions• Source of an exporting write is often– Young– rarely referenced from outside the closure

Previous Session Current Session FreeLocal Heap

SessionStart Frontier

Young Objects

Old Objects Start

28

• Current session closed & new session opened– After an exporting write, a user-level context switch, a local

GC– SessionStart is moved to Frontier

Heap Sessions• Source of an exporting write is often– Young– rarely referenced from outside the closure

• Average session size < 4KB

Previous Session FreeLocal Heap

Frontier & SessionStartStart

29

Cleanliness: Eager exporting writes• A clean object closure

– is fully contained within the current session– has no references from previous session

Previous SessionCurrent Session

FreeLocal Heap

X

Y Z

r := x

rShared Heap

30

Cleanliness: Eager exporting writes• A clean object closure

– is fully contained within the current session– has no references from previous session

Previous SessionCurrent Session

FreeLocal Heap

X

Y Z

r := x

rShared Heap

Walk and fix

FWD

31

Avoid tracing current session?• Many SML objects are tree-structured (List, Tree, etc,.)

– Specialize for no pointers from outside the closure

• ∀x’ transitive closure (x), ∊ref_count (x) = 0 && ref_count (x’) = 1

Local Heap

x(0)

y(1)

z(1)

• Eager exporting write– No current session tracing needed!

No refs from

outside

– ref_count does not consider pointers from stack or registers

32

Cleanliness: Reference Count

Current Session

X(0)

Current Session

X(1)

Current Session

X(LM)

Current Session

X(G)

PrevSess

Zero One LocalMany Global

• Purpose– Track pointers from previous session to current session– Identify tree-structured object

• Does not track pointers from stack and registers– Reference count only triggered during object initialization

and mutation

33

Bringing it all together• ∀x’ transitive closure (x), ∊

if max (ref_count (x’))– One & ref_count (x) = 0 Clean tree-structured ⇒ ⇒

Session tracing not needed– LocalMany Clean ⇒ ⇒ Trace current session– Global 1+ pointer from previous session ⇒ ⇒

Procrastinate

34

Cleanliness: Tree-structured Closure

Previous Session Current Session

x(0)

y(1)

z(1)T1Local Heap

Shared heap

r := x

r

current stack

35

Shared heap

Cleanliness: Tree-structured Closure

Previous Session Current Session

x

y

z

T1current

stackLocal Heap

r := x

FWD

r

Walk current

stack

36

Shared heap

Cleanliness: Tree-structured Closure

Previous Session Current Session

x

y

z

T1current

stackLocal Heap

r := x

r

No need to walk current

session!

37

Shared heap

Cleanliness: Tree-structured Closure

Previous Session Current Session

x

y

z

T1Local Heap

r := x

r

T2 Nextstack

FWDcurrent stack

38

Shared heap

Cleanliness: Tree-structured Closure

Previous Session Current Session

x

y

z

T1previous

stackLocal Heap

r := x

r

T2 currentstack

Context Switch

Walk target stack

39

Cleanliness: Object graph

Previous Session Current Session

x(0)

y(LM)

z(1)current

stackLocal Heap

Shared heap

r := x

r

a

40

Shared heap

Cleanliness: Object graph

Previous Session Current Session

x

y

z

current stack

Local Heap

r := x

r

a

FWD

FWD

Walk current

stack

Walk current session

41

Shared heap

Cleanliness: Object graph

Previous Session Current Session

x

y

z

current stack

Local Heap

r := x

r

a

Walk current

stack

Walk current session

42

Cleanliness: Global Reference

Previous Session Current Session

x(0)

y(1)

z(G)T1 current

stackLocal Heap

Shared heap

r := x

r

a

43

Cleanliness: Global Reference

Previous Session Current Session

x(0)

y(1)

z(G)T1 current

stackLocal Heap

Shared heap

r := x

r

a

Procrastinate

44

Immutable Objects• Specialize exporting writes• If immutable object in previous session– Copy to shared heap• Immutable objects in SML do not have identity

– Original object unmodified• Avoid space leaks– Treat large immutable objects as mutable

45

Cleanliness: Summary• Cleanliness allows eager exporting writes

while preserving visibility invariant• With Procrastination + Cleanliness, <1% of

local GCs were forced• Additional niceties– Completely dynamic Portable– Does not impose any restriction on the GC

strategy

46

Evaluation• Variants– RB- : TLC with Procrastination and Cleanliness – RB+ : TLC with read barriers

• Sansom’s dual-mode GC– Cheney’s 2-space copying collection Jonker’s sliding

mark-compacting– Generational, 2 generations, No aging

• Target Architectures: – 16-core AMD Opteron server (NUMA)– 48-core Intel SCC (non-cache coherent)– 864-core Azul Vega3

47

Results• Speedup: At 3X min heap size, RB- faster than

RB+– AMD 32% (2X faster than STW collector)– SCC 20%– AZUL 30%

• Concurrency– During exporting write, 8 runnable user-level

threads/core!

Cleanliness Impact• RB- MU- : RB- GC ignoring mutability for Cleanliness• RB- CL- : RB- GC ignoring Cleanliness (Only Procrastination)

48

Avg. slowdown--------------------

11.4%28.2%31.7%

49

Conclusion• Eliminate the need for read barriers by

preserving the visibility invariant– Procrastination: Exploit concurrency for delaying

exporting writes– Cleanliness: Exploit generational property for

eagerly perform exporting writes

50

Questions?

http://multimlton.cs.purdue.edu

top related