a block-structured heap simplifies parallel gc simon marlow (microsoft research) roshan james (u....

A Block-structured Heap Simplifies Parallel GC

Simon Marlow (Microsoft Research)Roshan James (U. Indiana)

Tim Harris (Microsoft Research)Simon Peyton Jones (Microsoft Research)

Problem Domain

• Stop the world and collect using multiple threads.– we are not tackling the problem of GC running

concurrently with program execution, for now.– we are not tackling the problem of independent GC in a

program running on multiple CPUs (but plan to later).

• Our existing GC is quite complex:– Multi-generational– Arbitrary aging per generation– Eager promotion: promote an object early if it is

referenced by an old generation.– Copying or compaction for the old generation (parallelise

copying only for now)– Typical allocation rate: 100Mb-1Gb/s

Background: copying collection

Allocation area

To-space

Roots point to live objects

Copy live objects to to-space

Scan live objects for more roots Complete when scan pointer catches up with allocation pointer.

How can we parallelise this?

• The main problem is finding an effective way to partition the problem, so we can keep N CPUs busy all the time.

• Static partitioning (eg. partition the heap by address) isn’t good:– live data might not be evenly distributed– Need synchronisation when pointers

cross partition boundaries

Work queues

• So typically, we need dynamic partitioning for GC. – The available work (pointers to object to be

scanned) is kept on a queue– CPUs remove items from the queue, scan the

object, and add more roots to the queue.– eg. Flood, Detlefs, Shavit, Zhang (2001)– Good work partitioning, but

• need separate work queues: in single-threaded GC, the to-space is the work queue.

• clever lock-free data structures• extra administrative overhead• some strategy for overflow (GC can’t use arbitrary

extra memory!)

A block-structured heap

• Heap is divided into blocks, e.g. 4k• Blocks can be linked together in lists• GC sits on top of a block allocator, which

manages a free list of blocks.• Each block has a “block descriptor”: a

small data structure including the link field, which generation it belongs to, …

• Getting to the block descriptor from an arbitrary address is a pure function (~6 instructions)

Block-structured heap

• Advantages:– Memory can be recycled quickly: less

wastage, better cache behaviour– Flexible: dynamic resizing of

generations is easy – Large objects can be stored in their own

blocks, and managed separately.

Best of all…

• Since to-space is a list of blocks, it is an ideal work queue for parallel GC.– No need for a separate work queue, no extra

admin overhead relative to single threaed GC.– ~4k is large enough that contention for the

global block queue should be low– ~4k is small enough that we should still scale

to large numbers of threads

But what if…

• … there isn’t enough work to fill a block? E.g. If the heap consists of a single linked list of integers, then the scan pointer will always be close to the allocation pointer, we will never generate a full block of work.– then there isn’t much available

parallelism anyway!

Available parallelism

• There’s enough parallelism, at least in old-gen collections.

The details…

• GHC’s heap is divided into generations.• Each generation is divided into “steps” for aging.• The last generation has only one step.

Queues per step

Gen 0, step 1Work queue

Done queue

Thread 1

WorkspacesThread 0

Step 0 Step 1

Step 0

Step 0 Step 1

Step 0

Generation 0

Generation 1

Generation 2

Inside a workspace…

• Objects copied to this step are allocated into the todo block (per-thread allocation!)

• Loop:– Grab a block to be scanned from the work queue on a

step– Scan it– Push it back to the “done” list on the step– When a todo block becomes full, move it to the global

work queue for this step, grab an empty block

todo blockscan block

Scan pointer Alloc pointer

= free memory

= not scanned

= scanned

Inside a workspace…todo blockscan block

Scan pointer Alloc pointer

= free memory

= not scanned

= scanned

• When there are no full blocks of work left:– Make a scan block = the todo block– Scan until complete– Look for more full blocks…– We want to avoid fragmentation: never flush a

partially full block to the step unless absolutely necessary, keep it as the todo block.

Termination

• When a thread finds no work, it increments a semaphore

• If it finds the semaphore is == number of threads, exit.

• If there is work to do, decrement the semaphore and continue (don’t remove the work from the queue until the semaphore has been decremented).

Optimisations…

• Keep a list of “done” blocks per workspace, avoiding contention for global list. Concatenate them all at the end.

• Buffer the global work queue locally per workspace. A one block buffer is enough to reduce contention significantly.

• Some objects don’t need to be scanned, copy them to a separate non-scanned block (single-threaded GC already does this).

• Keep the thread-local state structure (workspaces) in a register.

Forwarding pointers

• Must synchronise if two threads attempt to copy the same object, otherwise the object is duplicated.

• Use CAS to install the forwarding pointer; if another thread installs the pointer first, return it (don’t copy the object). One CAS per object!

• CAS on a constructor not strictly necessary… just accept some duplication?

PayloadHeader

Object is copied Into to-space

Overwrite with aforwarding pointer

Status

• First prototype completed by Roshan James as an intern project this summer. Working multi-threaded, but speedup wasn’t quite what we hoped for (0% - 30% on 2 CPUs).

• Rewrite in progress, currently working single-threaded. Even with one CAS per object, only very slightly slower than existing single-threaded GC. I’m optimistic!

• We’re hooking up CPU performance counters to the runtime to see what’s really going on; I want to see if the cache behaviour can be tuned.

Further work• Parallelise mark/compact too

– No CAS required when marking (no forwarding pointers)– Blocks make parallelising compaction easier: just

statically partition the list of marked heap blocks and compact each segment, concatenate the result.

• Independent minor GCs.– Hard to parallelise minor GC: too quick, not enough

parallelism– Stopping the world for minor GC is a severe bottleneck

in a program running on multiple CPUs.– So do per-CPU independent minor GCs.– Main techincal problem: either track or prevent inter-

minor-generation pointers. (eg. Doligez/Leroy(1993) for ML, Steensgaard(2001)).

• Can we do concurrent GC?

a block-structured heap simplifies parallel gc simon marlow (microsoft research) roshan james (u....

Documents

watson-marlow - qdos

pumpsforindustry new watson-marlow bredel: value for...

marlow newsletter

concurrency unlocked transactional memory for composable...

marlow-hunter, llc

images.library.wisc.eduimages.library.wisc.edu/wi/efacs/fortatkinsonlochist/fawrights1905/...marlow...

solving an old problem: how do we get a stack trace in a...

1 optimizing memory transactions tim harris, maurice...

simon mercer, microsoft, pour la journée e-health 2013

carrie marlow

simon peyton jones microsoft research,...

simon peyton jones microsoft research,...

marlow wall tile - amazon web...

a wander through ghc’s new io library simon marlow

marlow hunter...

the marlow donkey - the magazine of the marlow and ... ·...

marlow-hunter 31

marlow art guitars

catalogue marlow 2012

getting started with microsoft sharepoint server...