1 parallelizing irregular applications through the exploitation of amorphous data-parallelism keshav...

1

Parallelizing Irregular Applicationsthrough the Exploitation of

Amorphous Data-parallelism

Keshav Pingali (UT, Austin)Milind Kulkarni (Purdue)

Mario Mendez-Lojo (UT, Austin)Donald Nguyen (UT, Austin)

2

Objectives

• What are irregular applications?– how are they different from regular applications?– why are they so hard to parallelize?

• Is there parallelism in irregular applications?– operator formulation of algorithms– amorphous data-parallelism (ADP)– Galois programming model and baseline execution model– ParaMeter tool for estimating ADP in Galois programs

• Is there structure in irregular algorithms?– structural analysis of irregular algorithms– exploiting structure for efficient parallel implementation– Galois system

• What are the open research problems?

3

Schedule

• 1:30 PM-2:30 PM– Irregular applications, amorphous data parallelism, structure

(Keshav Pingali)• 2:30 PM-3:00 PM

– ParaMeter (Milind Kulkarni)• 3:00 PM-3:15 PM

– Break• 3:15 PM-4:15 PM

– Galois programming model (Mario Mendez-Lojo)

• 4:15 PM-4:50 PM– Galois performance optimizations (Donald Nguyen)

• 4:50 PM-5:00 PM– Research problems (Keshav Pingali)

4

Irregular applications

5

Definitions

• Regular applications– key data structures are

• vectors • dense matrices

– simple access patterns • (eg) array indices are affine functions of for-loop indices

– examples: • MMM, Cholesky & LU factorizations, stencil codes, FFT,…

• Irregular applications– key data structures are

• lists, priority queues • trees, DAGs, graphs • usually implemented using pointers or references

– complex access patterns– examples: see next slide

6

ExamplesApplication/domain Algorithm

AI Bayesian inferences, survey propagation

Compilers Iterative and elimination-based dataflow algorithms

Functional interpreters Graph reduction, static and dynamic dataflow

Maxflow Preflow-push, augmenting paths

Computational biology Protein homology networks, phylogenetic reconstruction

Event-driven simulation Chandy-Misra-Bryant, Jefferson Timewarp

Meshing Generation/refinement/partitioning

Social networks Connecting the dots

Data-mining Clustering

7

Problem• Handwritten parallel implementations exist for

some irregular problems • No systematic way of thinking about parallelism

in irregular applications– no abstractions for thinking about parallelism – many mechanisms exist but it is not clear when these

mechanisms are useful• static parallelization: dependence analysis and

restructuring• inspector-executor parallelization• maximal independent set computation in interference

graphs• thread-level speculation (TLS), transactional memory (TM)• …..

Each new irregular application is a “new phenomenon” seemingly unrelated to other applications

• Need a science of parallel programming– analysis: framework for thinking about parallelism and

locality in applications– synthesis: guide for producing efficient parallel

implementations

8

Science of electro-magnetism

Seemingly unrelated phenomena Unifying abstractions

Specialized modelsthat exploit structure

9

Organization of talk

• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– Graph reduction of functional languages– ………

• Unifying abstractions– Operator formulation of algorithms– Amorphous data-parallelism– Galois programming model– Baseline parallel implementation

• Specialized implementations that exploit structure– Structure of algorithms– Optimized compiler and runtime system

support for different kinds of structure

10

Seemingly unrelated algorithms

11

Stencil computation: Jacobi iteration• Finite-difference method for solving pde’s

– discrete representation of domain: grid• Values at interior points are updated using values at

neighbors– values at boundary points are fixed

• Data structure: – dense arrays

• Parallelism: – values at next time step can be computed simultaneously– parallelism is not dependent on runtime values

• Compiler can find the parallelism– spatial loops are DO-ALL loops

//Jacobi iteration with 5-point stencil//initialize array Afor time = 1, nsteps for <i,j> in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for <i,j> in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j)

Jacobi iteration, 5-point stencil

A temp

tn tn+1

12

Delaunay Mesh Refinement• Iterative refinement to remove badly

shaped triangles:while there are bad triangles do {

Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad triangles}

• Don’t-care non-determinism:– final mesh depends on order in which bad

triangles are processed– applications do not care which mesh is

produced• Data structure:

– graph in which nodes represent triangles and edges represent triangle adjacencies

• Parallelism: – bad triangles with cavities that do not

overlap can be processed in parallel– parallelism is dependent on runtime values

• compilers cannot find this parallelism – (Gary Miller et al) at runtime, repeatedly

build interference graph and find maximal independent sets for parallel execution

Mesh m = /* read in mesh */WorkList wl;wl.add(m.badTriangles());while (true) { if ( wl.empty() ) break;

Element e = wl.get(); if (e no longer in mesh) continue;Cavity c = new Cavity(e);//determine

new cavityc.expand();c.retriangulate();//re-triangulate regionm.update(c);//update meshwl.add(c.badTriangles());

}

13

Event-driven simulation

• Stations communicate by sending messages with time-stamps on FIFO channels

• Stations have internal state that is updated when a message is processed

• Messages must be processed in time-order at each station

• Data structure:– Messages in event-queue, sorted in time-order

• Parallelism: – activities created in future may interfere with

current activities interference graph approach will not work– Jefferson time-warp

• station can fire when it has an incoming message on any edge

• requires roll-back if speculative conflict is detected– Chandy-Misra-Bryant

• conservative event-driven simulation• requires null messages to avoid deadlock

2

5

AB

34

C

6

14

Remarks

• Diverse algorithms and data structures– relatively few algorithms use dense arrays– more common: graphs, trees, lists, priority queues,…

• Exploiting parallelism is very complex– DMR: (Miller et al) interference graph + maximal independent sets– Event-driven simulation: (Jefferson) Timewarp algorithm

• Complexity arises from several sources:– parallelism can be dependent on runtime values

• DMR, event-driven simulation, graph reduction,….– don’t-care non-determinism

• nothing to do with concurrency• DMR, graph reduction

– activities created in the future may interfere with current activities• event-driven simulation…

15





support for different kinds of structure• Research topics

16

Unifying abstractions

• Should provide a model of parallelism in irregular algorithms

• Ideally, unified treatment of parallelism in regular and irregular algorithms– parallelism in regular algorithms should emerge as a

special case of general model– (cf.) correspondence principles in Physics

• Abstractions should be effective– should be possible to write an interpreter to execute

algorithms in parallel

17

Traditional abstraction• Dependence graph

– nodes are computations– edges are dependences

• Parallelism– width of the computation graph– critical path: longest path from START to END

• Effective parallel computation graph model– dataflow model of Dennis, Arvind et al

• Inadequate for irregular applications– dependences between computations are

functions of runtime values– don’t-care non-determinism– conflicting work may be created dynamically– …

• Data structures play almost no role in this abstraction– in most programs, parallelism comes from

data-parallelism (concurrent operations on data structure elements)

• New abstraction– data-centric: data structures play a central role– we will use graph ADT to illustrate concepts

18

Graph ADT• Parallel algorithms are usually introduced using matrices

and matrix algorithms• More useful abstraction for us: graph

– abstract data type (ADT): • some number of nodes and edges• edges may be directed• nodes/edges may be labeled with values

– concrete representation• may be different for different applications, platforms and

graphs– application program written in terms of ADT

• (cf.) relational databases

• Graphs are usually sparse in most applications– even Jacobi example: 2D grid is a sparse graph in which

each internal node has four neighbors!• Special graphs

– cliques: isomorphic to square dense matrices– labeled graphs w/o edges: isomorphic to sets/multi-sets

.....

ADT

Concrete representation

19

i1

i2

i3

i4

i5

Operator formulation of algorithms• Algorithm = repeated application of operator to graph

– active element: • node or edge where computation is needed

– DMR: nodes representing bad triangles– Event-driven simulation: station with

incoming message• activity: applying operator to active element

– neighborhood:• set of nodes and edges read/written to

perform computation– DMR: cavity of bad triangle– Event-driven simulation: station

• distinct usually from neighbors in graph– ordering:

• order in which active elements must be executed in a sequential implementation

– any order (Jacobi,DMR, graph reduction)» unordered algorithms

– some problem-dependent order (event-driven simulation)

» ordered algorithms

: active node

: neighborhood

20

Parallelism

• Amorphous data-parallelism (ADP)– Active nodes can be processed in parallel,

subject to• neighborhood constraints• ordering constraints

• How do we make this model effective?• Unordered algorithms

– Activities are independent if their neighborhoods do not overlap

– Independent activities can be processed in parallel

– More generally, • Elements in the intersection of neighborhoods are

not written• Commutativity relations

– How do we find independent activities?• Ordered algorithms

– Independence is not enough – How do we determine what is safe to

execute w/o violating ordering?

i1

i2

i3

i4

i5

2

5

AB C

21

Galois programming model (PLDI 2007)

• Program written in terms of abstractions in model

• Two-level programming model• Joe programmers

– sequential, OO model – Galois set iterators: for iterating over unordered

and ordered sets of active elements• for each e in Set S do B(e)

– evaluate B(e) for each element in set S– no a priori order on iterations– set S may get new elements during execution

• for each e in OrderedSet S do B(e)– evaluate B(e) for each element in set S– perform iterations in order specified by OrderedSet– set S may get new elements during execution

• Stephanie programmers– Galois concurrent data structure library

• (Wirth) Algorithms + Data structures = Programs

Mesh m = /* read in mesh */Set ws;ws.add(m.badTriangles()); // initialize ws

for each tr in Set ws do { //unordered Set iterator if (tr no longer in mesh) continue;

Cavity c = new Cavity(tr);c.expand();c.retriangulate();m.update(c);ws.add(c.badTriangles()); //bad

triangles }

DMR using Galois iterators

22

Concurrent

Data structure

main()….for each …..{…….…….}.....

Master

Joe Program

• Parallel execution model:– shared-memory– optimistic execution of Galois

iterators• Implementation:

– master thread begins execution of program

– when it encounters iterator, worker threads help by executing iterations concurrently

– several work-distribution strategies implemented in Galois (cf. OpenMP)

– barrier synchronization at end of iterator

• Independence of neighborhoods:– software TM variety – logical locks on nodes and edges

• Ordering constraints for ordered set iterator:– execute iterations out of order

but commit in order– cf. out-of-order CPUs

Galois parallel execution model

i1

i2

i3

i4

i5

23

Parameter tool (PPoPP 2009)

• Idealized execution model:– unbounded number of processors– applying operator at active node takes one time step– execute a maximal set of active nodes, subject to

neighborhood and ordering constraints

• Measures amorphous data-parallelism in irregular program execution

• Useful as an analysis tool• Milind will talk about this later

24

Example: DMR

• Input mesh:– Produced by Triangle

(Shewchuck)– 550K triangles– Roughly half are badly

shaped• Available parallelism:

– How many non-conflicting triangles can be expanded at each time step?

• Parallelism intensity:– What fraction of the total

number of bad triangles can be expanded at each step?

25

Summary of ADP abstraction

• Old abstraction: dependence graphs– inadequate for most irregular algorithms

• New abstraction: operator formulation– active elements – neighborhoods– ordering of active elements

• Baseline execution model– Galois programming model

• sequential, OO• uses abstractions in model

– optimistic parallel execution– conflict detection scheme: many choices

• exclusive locks• read/write locks• commutativity relations

• Amorphous data-parallelism– generalizes conventional data-parallelism

i1

i2

i3

i4

i5

26





support for different kinds of structure• Ongoing work

27

Structure in irregular algorithms

• Baseline implementation is general but usually inefficient– (eg) dynamic scheduling of iterations is not needed for stencil codes

since grid structure is known at compile-time– (eg) hand-written parallel implementations of DMR and stencil codes do

not buffer updates to neighborhood until commit point• Efficient execution requires exploiting structure in algorithms and

data structures• How do we talk about structure in algorithms?

– Previous approaches: like descriptive biology• Mattson et al book• Parallel programming patterns (PPP): Snir et al • Berkeley motifs: Patterson, Yelick, et al• …

– Our approach: like molecular biology• structural analysis of algorithms• based on amorphous data-parallelism framework

28

Structural analysis of irregular algorithms

irregularalgorithms

topology

operator

ordering

morph

local computation

reader

general graph

grid

tree

unordered

ordered

refinement

coarsening

general

topology-driven

data-driven

Jacobi: topology: grid, operator: local computation, ordering: unordered DMR, graph reduction: topology: graph, operator: morph, ordering: unorderedEvent-driven simulation: topology: graph, operator: local computation, ordering: ordered

29

Exploiting structure to reduce overheads

30

Cautious operators (PPoPP 2010)

• Cautious operator implementation:– reads all the elements in its neighborhood

before modifying any of them– (eg) Delaunay mesh refinement

• Algorithm structure:– cautious operator + unordered active

elements• Optimization: optimistic execution w/o

buffering – grab locks on elements during read phase

• conflict: someone else has lock, so release your locks

– once update phase begins, no new locks will be acquired

• update in-place w/o making copies• zero-buffering

– note: this is not two-phase locking

31

Graph partitioning (ASPLOS 2008)

• Algorithm structure: – general graph/grid + unordered active elements

• Optimization I: – partition the graph/grid and work-set between cores– data-centric work assignment: core gets active elements from its own partition

• Pros and cons:– eliminates contention for worklist– improves locality and can dramatically reduce conflicts– dynamic load-balancing may be needed

• Optimization II:– lock coarsening: associate logical locks with partitions, not graph elements– reduces overhead of lock management

• Over-decomposition may improve core utilization

Cores

32

Eliminating speculation

• Coordinated execution of activities: if we can build dependence graph – Run-time scheduling:

• cautious operator + unordered active elements• execute all activities partially to determine neighborhoods• create interference graph and find independent set of activities• execute independent set of activities in parallel w/o synchronization

– Just-in-time scheduling:• local computation + topology-driven (eg) tree walks, sparse MVM• inspector-executor approach

– Compile-time scheduling: • previous case + graph is known at compile-time (eg) Jacobi, FFT, MMM, Cholesky• make all scheduling decisions at compile-time time

DMR Results

• 0.5M triangles, 0.25M bad triangles

• Best serial: locallifo.flagopt• Serial time: 17002 ms• Best // time: 3745 ms• Best speedup: 4.5X

• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size

Mario and Donald will talk about Galois programming model and optimizations

FAQs

• How is this different from TM?– Programming model:

• TM: explicitly parallel (threads)– transactions: synchronization mechanism for threads– mostly memory-level conflict detection + boosting

• Galois: Joe programs are sequential OO programs– ADT-level conflict detection

– Where do threads come from?• TM: someone else’s problem • Galois project: focus on

– sources of parallelism in algorithms– structure of parallelism in algorithms

– (Wirth) Algorithms + Data structures = Programs• Galois: focus on algorithms and algorithmic structure• TM: focus on concurrent data structures

– Galois: mechanisms customized to algorithm structure• Few algorithms require full-blown speculation/optimistic parallelization

34

FAQs (contd.)

• How is this different from TLS?– Programming model:

• Galois: separation between ADT and its implementation is critical– permits separation of Joe and Stephanie layers (cf. relational databases)– permits more aggressive conflict detection schemes like commutativity relations

• TLS: FORTRAN/C, so no separation between ADT and implementation

– Programming model:• Galois: don’t-care non-determinism plays a central role • TLS: FORTRAN/C, so only ordered algorithms

– Exploiting algorithmic structure:• Galois: exploits algorithmic structure to

– reduce speculative overheads– focus speculation only where needed

35

36

Summary

• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– ………



support for different kinds of structure

37

Schedule

• 1:30 PM-2:30 PM– Irregular algorithms, amorphous data parallelism, structure (Keshav

Pingali)• 2:30 PM-3:00 PM

– ParaMeter (Milind Kulkarni)• 3:00 PM-3:15 PM

– Break• 3:15 PM-4:15 PM

– Galois programming model (Mario Mendez-Lojo)

• 4:15 PM-4:50 PM– Galois performance optimizations (Donald Nguyen)

• 4:50 PM-5:00 PM– Research problems (Keshav Pingali)

38

Research ideas

• Application studies:– Lonestar benchmark suite for irregular programs

http://iss.ices.utexas.edu/lonestar/ • Development of ADP model

– locality – intra-operator parallelism

• Identifying other algorithmic structure useful in parallel implementations– (eg) monotonically growing graphs (see Mario’s paper in OOPSLA 2010)

• Language/compiler analysis– how do you find/communicate structure to the implementation?– (eg) use program analysis to find cautious operators

• Specializing data structure implementations to particular algorithms– can this be done semi-automatically?– programmer writes sequential implementation of ADT and tool generates thread-

safe version for use in Galois program?

n1

n2

n4

n3

h4

h3

h2

n1

n2

n4

n3

h4

h3

h2

n1

n2

n4

h1

n3

h2

h4

h3

39

Research ideas (contd.)

• Auto-tuning– concrete implementations for graphs– conflict detection strategies– scheduling strategies– contention-management policies– …..

• Load-balancing– need “data-centric work-stealing”– move partitions, not active nodes– see Milind’s paper in SPAA 2010

• Understanding performance• Architecture

– existing core designs use lots of real estate for caches– may not be well-suited for irregular applications:

• little spatial and temporal locality• relatively small amount of computation per memory access

– one idea:• use multi-threading to hide latency• NVIDIA GPU-like architecture for irregular algorithms?

• Hardware-software codesign– can we generate custom cores for particular applications to exploit ADP?

40

Science of Parallel Programming

Seemingly unrelated algorithms

Unifying abstractionsSpecialized models

that exploit structure

2 A B

……..

i1

i2

i3

i4

i5

1 parallelizing irregular applications through the exploitation of amorphous data-parallelism keshav...

Documents

slide slide

pm irregular applications

parallel parallelism

amorphous data parallelism

austin slide

irregular problems

pm break

processed applications