a block-structured heap simplifies parallel gc simon marlow (microsoft research) roshan james (u....
Post on 01-Apr-2015
215 Views
Preview:
TRANSCRIPT
A Block-structured Heap Simplifies Parallel GC
Simon Marlow (Microsoft Research)Roshan James (U. Indiana)
Tim Harris (Microsoft Research)Simon Peyton Jones (Microsoft Research)
Problem Domain
• Stop the world and collect using multiple threads.– we are not tackling the problem of GC running
concurrently with program execution, for now.– we are not tackling the problem of independent GC in a
program running on multiple CPUs (but plan to later).
• Our existing GC is quite complex:– Multi-generational– Arbitrary aging per generation– Eager promotion: promote an object early if it is
referenced by an old generation.– Copying or compaction for the old generation (parallelise
copying only for now)– Typical allocation rate: 100Mb-1Gb/s
Background: copying collection
Allocation area
To-space
Roots point to live objects
Copy live objects to to-space
Scan live objects for more roots Complete when scan pointer catches up with allocation pointer.
How can we parallelise this?
• The main problem is finding an effective way to partition the problem, so we can keep N CPUs busy all the time.
• Static partitioning (eg. partition the heap by address) isn’t good:– live data might not be evenly distributed– Need synchronisation when pointers
cross partition boundaries
Work queues
• So typically, we need dynamic partitioning for GC. – The available work (pointers to object to be
scanned) is kept on a queue– CPUs remove items from the queue, scan the
object, and add more roots to the queue.– eg. Flood, Detlefs, Shavit, Zhang (2001)– Good work partitioning, but
• need separate work queues: in single-threaded GC, the to-space is the work queue.
• clever lock-free data structures• extra administrative overhead• some strategy for overflow (GC can’t use arbitrary
extra memory!)
A block-structured heap
• Heap is divided into blocks, e.g. 4k• Blocks can be linked together in lists• GC sits on top of a block allocator, which
manages a free list of blocks.• Each block has a “block descriptor”: a
small data structure including the link field, which generation it belongs to, …
• Getting to the block descriptor from an arbitrary address is a pure function (~6 instructions)
Block-structured heap
• Advantages:– Memory can be recycled quickly: less
wastage, better cache behaviour– Flexible: dynamic resizing of
generations is easy – Large objects can be stored in their own
blocks, and managed separately.
Best of all…
• Since to-space is a list of blocks, it is an ideal work queue for parallel GC.– No need for a separate work queue, no extra
admin overhead relative to single threaed GC.– ~4k is large enough that contention for the
global block queue should be low– ~4k is small enough that we should still scale
to large numbers of threads
But what if…
• … there isn’t enough work to fill a block? E.g. If the heap consists of a single linked list of integers, then the scan pointer will always be close to the allocation pointer, we will never generate a full block of work.– then there isn’t much available
parallelism anyway!
Available parallelism
• There’s enough parallelism, at least in old-gen collections.
The details…
• GHC’s heap is divided into generations.• Each generation is divided into “steps” for aging.• The last generation has only one step.
Queues per step
Gen 0, step 1Work queue
Done queue
Thread 1
WorkspacesThread 0
Step 0 Step 1
Step 0 Step 1
Step 0
Step 0 Step 1
Step 0 Step 1
Step 0
Generation 0
Generation 1
Generation 2
Inside a workspace…
• Objects copied to this step are allocated into the todo block (per-thread allocation!)
• Loop:– Grab a block to be scanned from the work queue on a
step– Scan it– Push it back to the “done” list on the step– When a todo block becomes full, move it to the global
work queue for this step, grab an empty block
todo blockscan block
Scan pointer Alloc pointer
= free memory
= not scanned
= scanned
Inside a workspace…todo blockscan block
Scan pointer Alloc pointer
= free memory
= not scanned
= scanned
• When there are no full blocks of work left:– Make a scan block = the todo block– Scan until complete– Look for more full blocks…– We want to avoid fragmentation: never flush a
partially full block to the step unless absolutely necessary, keep it as the todo block.
Termination
• When a thread finds no work, it increments a semaphore
• If it finds the semaphore is == number of threads, exit.
• If there is work to do, decrement the semaphore and continue (don’t remove the work from the queue until the semaphore has been decremented).
Optimisations…
• Keep a list of “done” blocks per workspace, avoiding contention for global list. Concatenate them all at the end.
• Buffer the global work queue locally per workspace. A one block buffer is enough to reduce contention significantly.
• Some objects don’t need to be scanned, copy them to a separate non-scanned block (single-threaded GC already does this).
• Keep the thread-local state structure (workspaces) in a register.
Forwarding pointers
• Must synchronise if two threads attempt to copy the same object, otherwise the object is duplicated.
• Use CAS to install the forwarding pointer; if another thread installs the pointer first, return it (don’t copy the object). One CAS per object!
• CAS on a constructor not strictly necessary… just accept some duplication?
PayloadHeader
PayloadHeader
Object is copied Into to-space
FWD
Overwrite with aforwarding pointer
Status
• First prototype completed by Roshan James as an intern project this summer. Working multi-threaded, but speedup wasn’t quite what we hoped for (0% - 30% on 2 CPUs).
• Rewrite in progress, currently working single-threaded. Even with one CAS per object, only very slightly slower than existing single-threaded GC. I’m optimistic!
• We’re hooking up CPU performance counters to the runtime to see what’s really going on; I want to see if the cache behaviour can be tuned.
Further work• Parallelise mark/compact too
– No CAS required when marking (no forwarding pointers)– Blocks make parallelising compaction easier: just
statically partition the list of marked heap blocks and compact each segment, concatenate the result.
• Independent minor GCs.– Hard to parallelise minor GC: too quick, not enough
parallelism– Stopping the world for minor GC is a severe bottleneck
in a program running on multiple CPUs.– So do per-CPU independent minor GCs.– Main techincal problem: either track or prevent inter-
minor-generation pointers. (eg. Doligez/Leroy(1993) for ML, Steensgaard(2001)).
• Can we do concurrent GC?
top related