Galois Performance
Mario Mendez-LojoDonald Nguyen
2
Overview
• Galois system is a test bed to explore opts– Safe but not fast out of the box
• Important optimizations– Select least transactional overhead– Select right scheduling– Select appropriate data structure
• Quantify optimizations on applications
3
Algorithms
irregularalgorithms
topology
operator
ordering
morph
local computation
reader
general graph
grid
tree
unordered
ordered
1. Barnes-Hut
2. Delaunay Mesh Refinement
3. Preflow-push
4
MethodologyTh
read
s
IdleSerial GC
Time
Compute
• Abort Ratio: Aborted It/Total it
• GC options• UseParallelGC• UseParallelOldGC• NewRatio=1
5
Terms
• Base– Default scheduling, Default graph
• Serial– Galois classes => No concurrency control classes
• Speedup– Best mean performance of a serial variant
• Throughput– # Serial Iterations / time
6
Numbers
• Runtime– Last of 5 runs in same VM– Ignore time to read and construct initial graph
• Other statistics– Last of 5 runs
7
Test Environment
• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size
8
BARNES-HUT
Most Distant Galaxy Candidates in the Hubble Ultra Deep Field
9
Barnes-Hut• N-body algorithm
– Oct-tree acceleration structure– Serial
• Tree build, center of mass, particle update
– Parallel• Force computation
• Structure– Reader on tree
• Variants– Splash2, Reader Galois
10
Reader Optimization
child = octree.getNeighbor(nn, 1);
child = octree.getNeighbor(nn, 1, MethodFlag.NONE);
11
ParaMeter Profile
12
Barnes-Hut Results
100,000 points, 1 time step
Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X
13
Barnes-Hut Results
100,000 points, 1 time step
Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X
14
Barnes-Hut Scalability
15
16
DELAUNAY MESH REFINEMENT
17
Delaunay Mesh Refinement
• Refine “bad” triangles– Maintained in worklist
• Structure– Cautious operator on graph
• Variants– Flag optimized, locallifo
base: Priority.defaultOrder()
local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)
Cautious Optimization
mesh.contains(item);...
mesh.remove(preNodes.get(i));...
mesh.add(node);
mesh.contains(item, MethodFlag.CHECK_CONFLICT);...
mesh.remove(preNodes.get(i), MethodFlag.NONE);...
mesh.add(node, MethodFlag.NONE);
• No need to save undo info• Only check conflicts up to first write
19
LIFO Optimization
GaloisRuntime.foreach(...,
Priority.defaultOrder());
GaloisRuntime.foreach(...,
Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));
20
ParaMeter Profile
21
DMR Results
0.5M triangles, 0.25M bad triangles
Best serial: locallifo.flagoptSerial time: 17002 msBest // time: 3745 msBest speedup: 4.5X
22
23
PREFLOW-PUSH
Preflow-push
• Max-flow algorithm– Nodes push flow downhill
• Structure– Cautious, local computation
• Variants– Flag optimized, local computation graph
base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class)
base (relabel): Priority.first(ChunkedFIFO.class, 8)
25
Local Computation Optimization
graph = ...
graph = ...b = new LocalComputationGraph.ObjectGraphBuilder();
graph = b.from(graph).create()
26
ParaMeter Profile
27
Preflow-push Results
From challenge problem (genmf-wide)14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edgeshttp://avglab.com/andrew/CATS/maxflow_synthetic.htm
C: 11450 msJava: 30234 ms
Best serial: lc.flagoptSerial time: 57121 msBest // time: 18242 msBest speedup: 3.1X
28
Preflow-push Scalability
29
30
What performance did we expect?Th
read
s
Time
IdleSerial GC//Compute Miss-Speculation
Measured Indirectly
Synchronization, …
Error
31
What performance did we expect?
• Naïve: r(x) = t1 / x
• Amdahl: r(x) = tp / x + ts
t1 = tp + ts
ts = tidle + tgc+ tserial
• Simple: r(x) = (tp (ix / i1)) / x + ts
32
Barnes-Hut
33
Delaunay Mesh Refinement
34
Preflow-push
35
Summary
• Many profitable optimizations– Selecting among method flags, worklists, graph
variants
• Open topics– Automation– Static, dynamic and performance analysis– Efficient ordered algorithms
36