a new approach to parallelising tracing algorithms computer science department university of western...
TRANSCRIPT
A New Approach to Parallelising
Tracing Algorithms
Computer Science Department
University of Western Ontario
Computer Laboratory
University of Cambridge
Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt
I. Motivation & High Level Goal
We study more We study more scalable algorithms for parallel tracing: algorithms for parallel tracing: memory management is the primary motivation, memory management is the primary motivation, butbut
do not claim immediate improvements to state-of-the-art GC.do not claim immediate improvements to state-of-the-art GC.
Tracing is important to computing:Tracing is important to computing: sequential & flat memory model – well understood,sequential & flat memory model – well understood,
parallel & multi-level memory – less clear: parallel & multi-level memory – less clear: processor communication cost grows w.r.t. raw instr speed x P x ILPprocessor communication cost grows w.r.t. raw instr speed x P x ILP
Memory-centric algorithm for copy collection (a general Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.form of tracing) -- free of locks on the mainline path.
I. Abstract Tracing Algorithm
Assume an initialisation phase has already marked and Assume an initialisation phase has already marked and processed some root nodes.processed some root nodes.
Implementing the implicit fix-point via Implementing the implicit fix-point via worklists, yields: , yields:
1. mark and process any unmarked child of a marked node;
2. until no further marking is possible.
1. pick a node from a worklist; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists;
3. repeat until all worklists are empty.
I. Worklist Semantics: Classical
What should worklists model?What should worklists model?
Classical approach: Classical approach: processing semantics.processing semantics.
Worklist 1
WorklistWorklist ii stores nodes to be processed by stores nodes to be processed by processorprocessor ii!!
Worklist 2 Worklist 3 Worklist 4
I. Classic Algorithm
Two layers of synchronisation:Two layers of synchronisation: Worklist level – small overhead via Worklist level – small overhead via dequedeque (Arora (Arora et al.et al.) or ) or work work
tealingtealing (Michael (Michael et al.et al.)) Frustrating atomic block – gives idempotent copy, thus enables the Frustrating atomic block – gives idempotent copy, thus enables the
above small overhead worklist-access solutions. above small overhead worklist-access solutions.
while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child);} }
I. Related Work
Halstad (MultiLisp) – first parallel semi-space collector, Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions:but may lead to load imbalance. Solutions:
Object stealing: Arora Object stealing: Arora et al.et al. Flood Flood et al.et al., Endo , Endo et al.et al. ... ... Block-based approaches: Imai and Tick, Attanasio et al., Block-based approaches: Imai and Tick, Attanasio et al.,
Marlow et al., ...Marlow et al., ... Free-of-locks solutions via exploiting immutable data: Free-of-locks solutions via exploiting immutable data:
Doligez and Leroy, Huelsbergen and LarusDoligez and Leroy, Huelsbergen and Larus
Memory-centric solutions – studied only in the sequential Memory-centric solutions – studied only in the sequential case: Shuf case: Shuf et al.et al., Demers , Demers et alet al., Chicha and Watt.., Chicha and Watt.
II. Memory-Centric Tracing (High Level)
LL == memory partition (local) size; gives the trade-off between == memory partition (local) size; gives the trade-off between locality of reference and load balancing.locality of reference and load balancing.
Worklist Worklist jj stores slots: the to-space address pointing to a stores slots: the to-space address pointing to a from- from-space field space field ff of the currently copied/scanned object of the currently copied/scanned object oo && &&j = ( o.f quo L ) rem Nj = ( o.f quo L ) rem N
II. Memory-Centric Tracing (High Level)1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields
1. Each worklist w is owned by at most one collector c (owner)2. Forwarded slots of c: those slots belonging to a partition owned by c, but discovered by another collector.
3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned.
Dispatching Slots to Worklists or Forwarding Queues
II. Memory-Centric Tracing Implem.
Each collector processes its forwarding queues (size Each collector processes its forwarding queues (size FF)) Empty worklists are released (ownership).Empty worklists are released (ownership).
Each collector processes Each collector processes F*P*4F*P*4 items from its owned items from its owned worklists (worklists (44 empirically chosen – forwarding ratio inv). empirically chosen – forwarding ratio inv). No locking when accessing worklists or when copying.No locking when accessing worklists or when copying.
L (local partition size) gives the locality-of-reference level.L (local partition size) gives the locality-of-reference level.
Repeat untilRepeat until no owned worklists && all forw. no owned worklists && all forw. queues empty && all worklists empty. queues empty && all worklists empty.
II. Forwarding Queues on INTEL IA-32
Implement inter-processor communication:Implement inter-processor communication: with with PP collectors have a collectors have a PxPPxP matrix of queues; entry matrix of queues; entry (i,j)(i,j)
holds items enqueued by collector holds items enqueued by collector ii and dequeued by and dequeued by jj wait-free, lock-free and mfence-free IA-32 implementation.wait-free, lock-free and mfence-free IA-32 implementation.
volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F;
bool enq(Address slot) { bool is_empty() int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }
II. Forwarding Queues on INTEL IA-32
The sequentially inconsistent pattern occurs, but The sequentially inconsistent pattern occurs, but algorithm still safe:algorithm still safe: head & tailhead & tail interaction – reduces to a collector failing to interaction – reduces to a collector failing to
deq from a non-empty list (and to enq into a non-full list);deq from a non-empty list (and to enq into a non-full list);
buff[tail_prev] & head==tail_prevbuff[tail_prev] & head==tail_prev interaction interaction is safe because writes are not re-ordered.is safe because writes are not re-ordered.
a = b = 0; // Initially // (two enq) || (two is_empty; deq)
// // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head);// mfence; mfence; tail =...; if(head!=tail) x = a; y = b; if(new_tl==head) ..=buff[head]; // x == 0 & y == 0!
II. Dynamic Load Balancing
Small partitions (64K) -- OK under static ownership:Small partitions (64K) -- OK under static ownership: grey object -- randomly distributed among the N partitions,grey object -- randomly distributed among the N partitions, still gives some locality of reference still gives some locality of reference
(otherwise forwarding would be too expensive)(otherwise forwarding would be too expensive)
Larger partitions may need dynamic load balancing:Larger partitions may need dynamic load balancing: Partition ownership must be transferred:Partition ownership must be transferred:
A starving collector A starving collector cc signals nearby collectors; these may release signals nearby collectors; these may release ownership of an owned worklist ownership of an owned worklist ww while placing an item of while placing an item of ww on on collector collector cc's forwarding queue.'s forwarding queue.
Partition stealing requires locking on the mainline path since Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it (the copy operation is not idempotent without it (Michael Michael et al.et al.)! )!
II. Optimisation; Run-Time Adaptation
Inter-collector producer-consumer relations are detected when Inter-collector producer-consumer relations are detected when forwarding queues are found full (forwarding queues are found full (F*P*4F*P*4 processed items/iter processed items/iter): ): transfer ownership to the producer collector to optimise forwarding.transfer ownership to the producer collector to optimise forwarding.
Run-time adapt: monitor forw ratio (Run-time adapt: monitor forw ratio (FRFR) & load balancing () & load balancing (LBLB):): start with large start with large LL; ; whilewhile poor poor LBLB decrease decrease LL if FR > FR_MAX or L < L_MINif FR > FR_MAX or L < L_MIN switch to classical! switch to classical!
III. Empirical Results – Small Data Two quad-core AMD Opteron machine on Two quad-core AMD Opteron machine on smallsmall live live
data-sets applications against MMTK: data-sets applications against MMTK: Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS.Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. Heap Size = 120-200M, IFRav = 4.2, Heap Size = 120-200M, IFRav = 4.2, LL = 64K. = 64K.
1 2 4 6
0
20
40
60
80
100
120
140
160
139.714285714286
92.4285714285714
59.8571428571429
49.5714285714286
111.714285714286
77.2857142857143
54.285714285714348.7142857142857
GC Time Small Live Data Sets (Sequential Time is 100)
Memory-Centric SD
Classical SD
Number of Processors
Nor
mal
ised
Tim
e
III. Empirical Results – Large Data Two quad-core AMD Opteron machine on Two quad-core AMD Opteron machine on largelarge live live
data-sets applications against MMTK: data-sets applications against MMTK: Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST,
TSP, Perimet, BH.TSP, Perimet, BH. Heap Size > 500M, IFR average = 6.3, Heap Size > 500M, IFR average = 6.3, LL = 128K. = 128K.
1 2 4 6 8
0
20
40
60
80
100
120
140 131
78.5
40.5
27.62523.125
111.25
96.87592.5 92.75
95.625
GC Time Large Live Data Sets(Sequential Time is 100)
Memory-Centric LD
Classical LD
Number of Processors
Nor
mal
ised
Tim
e
III. Empirical Results – Eclipse Quad-core Intel machine on Eclipse (Quad-core Intel machine on Eclipse (largelarge live data-set): live data-set):
Heap Size = 500M, IFR average = (only) Heap Size = 500M, IFR average = (only) 2.6 2.6 for for LL = 512K, = 512K, otherwise 2.1!otherwise 2.1!
1 2 3 4
0
20
40
60
80
100
120
140
160148
100
81
69
116
69
57
48
GC Time Eclipse Large(Sequential Time is 100)
Memory-CentricClassical
Number of Processors
Nor
mal
ised
Tim
e
III. Empirical Results – Jython Two quad-core AMD machine on Jython:Two quad-core AMD machine on Jython:
Heap Size = 200M, IFR average = (only) 3.0Heap Size = 200M, IFR average = (only) 3.0!!
1 2 4 6
0
20
40
60
80
100
120
140
160
145
102
64
53
108
70
58
44
GC Time Jython(Sequential Time is 100)
Memory-Centric
Classical
Number of Processors
No
rma
lise
d T
ime
III. Conclusions Memory-centric algorithms may be an important
alternative to processing-centric algorithms, especially on non-homogeneous hardware.
How to explicitly represent and optimise two abstractions: locality of reference (L) and inter-processor communication (FR). L trade-offs locality for load balancing.
Robust behaviour: scales well with both data size and number of processors.