on improving the execution of distributed cnc programs

34
On Improving the Execution of Distributed CnC Programs Yuhan Peng 1 , Martin Kong 1 , Louis-Noel Pouchet 2 , Vivek Sarkar 1 1 Department of Computer Science Rice University 2 Department of Computer Science Colorado State University CNC-2016 Workshop, September 2016 1 / 34

Upload: truongkiet

Post on 04-Jan-2017

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Improving the Execution of Distributed CnC Programs

On Improving the Execution of Distributed CnCPrograms

Yuhan Peng1, Martin Kong1, Louis-Noel Pouchet2, VivekSarkar1

1Department of Computer ScienceRice University

2Department of Computer ScienceColorado State University

CNC-2016 Workshop, September 2016

1 / 34

Page 2: On Improving the Execution of Distributed CnC Programs

Concurrent Collections (CnC)

I Run-time and data-flow model for parallel programming.I No direct specification of parallel operations.

I The user specifies the semantics with data and controldependencies.

I The runtime decides the schedule of parallel tasks.

I Applicable on both shared and distributed memory.I Can reach performance comparable to OpenMP/MPI

applications.

2 / 34

Page 3: On Improving the Execution of Distributed CnC Programs

Data-Flow Graph Language (DFGL)

I Intermediate graph representation for macro dataflowprograms.

I Emphasizes the data dependencies between tasks.

I User-friendly, expressive language.I DFGL provides great opportunities for performing high-level

optimizationsI Optimizations can be done through graph and loop

transformations.I Especially good for polyhedral optimizations when program

exhibits regularity.

3 / 34

Page 4: On Improving the Execution of Distributed CnC Programs

Data-Flow Graph Language (DFGL)

I Automatic code generation tools can transform DFGL intoCnC code for being executed.

I DFGL framework.1

1Sbirlea, Alina, Louis-Noel Pouchet, and Vivek Sarkar. ”Dfgr anintermediate graph representation for macro-dataflow programs.” Data-FlowExecution Models for Extreme Scale Computing (DFM), 2014 FourthWorkshop on. IEEE, 2014.

4 / 34

Page 5: On Improving the Execution of Distributed CnC Programs

PIPES

I Programming language and compiler derived from DFGL.I Input: DFGL with producer and consumer relations, with other

language abstractions.I Output: Intel CnC C++ compilable program.

I Concentrates on virtual topologies and task mappings.

I Automatically applying optimization transformations such astask coarsening and coalescing.

I Goal: better supporting task-based programming for sharedand distributed memory.

5 / 34

Page 6: On Improving the Execution of Distributed CnC Programs

PIPES

I Framework of PIPES.2

I Great support for adding new optimization pass in PIPES core.

2M. Kong, L-N. Pouchet, P. Sadayappan, and V. Sarkar. ”Pipes: Alanguage and compiler for distributed memory task parallelism.” SC 16. IEEE,2016.

6 / 34

Page 7: On Improving the Execution of Distributed CnC Programs

Motivation

I Managing and controlling the runtime overhead is crucial.I In practice, such overhead depends on:

I The total number of tasks created.I The number of tasks in flight at a give time point.I The total number of input dependencies.

7 / 34

Page 8: On Improving the Execution of Distributed CnC Programs

MotivationI Johnson 3D matrix multiply algorithm: our motivating

example.I Introduced by Ramesh C. Agarwal et al. in 1995.I Parallelizable divide-and-conquer approach.

I To compute the product of A * B, Johnson 3D goes throughtwo steps:

I MMC: Divide A and B into small matrix pieces, and multiplythe small matrix pieces in parallel.

I MMR: Reduction to sum up the results of small matrix pieces.

8 / 34

Page 9: On Improving the Execution of Distributed CnC Programs

Motivation

I We start by dissecting the program execution time of theJohnson 3D algorithm.

I We tested the algorithm across different tile sizes.I We tested the algorithm across different number of nodes and

task mappings.

I The program overhead is a non-negligible portion of the totaltime.

I In distributed Johnson 3D algorithm, the overhead can takebetween 2% and 50% of the total execution time.

9 / 34

Page 10: On Improving the Execution of Distributed CnC Programs

MotivationI The overhead of the execution of Johnson 3D on different

number of processors.I We use 1-8 nodes where each node has 12 processors.I More processors, larger overhead proportion.I The overhead grows superlinearly.

10 / 34

Page 11: On Improving the Execution of Distributed CnC Programs

Our Approach

I Our goal: minimize the run-time overhead.I We proposed two transformation techniques:

I Dependency reduction.I Dynamic prescription.

11 / 34

Page 12: On Improving the Execution of Distributed CnC Programs

Dependency Reduction

I Objective: minimizing the number of tasks and/ordependencies.

I Avoids needless polling of satisfied dependencies.I Depending on the runtime scheduler.

I Improves the program’s progress.I Reduce the critical path length.I Minimize the number of task instances and block instances.I The processors will have fewer tasks to handle, and fewer

dependencies to query.

12 / 34

Page 13: On Improving the Execution of Distributed CnC Programs

Dependency Reduction

I The user may specify a reduction factor R.I Then we transform the dataflow graph, so that

I The semantic of the input DFGL does not change.I Minimize the total number of dependencies.I Essentially contracts one dimension by a factor of R.

I If R = 1, no change.I If R = 2, every two instances are fused. i.e. N / 2 instances

left.I If R = N, all instances are fused. i.e. dimension collapses.

13 / 34

Page 14: On Improving the Execution of Distributed CnC Programs

Dependency Reduction

I The dependency diagram of the original Johnson 3Dalgorithm.

14 / 34

Page 15: On Improving the Execution of Distributed CnC Programs

Dependency ReductionI The dependency diagram of the Johnson 3D algorithm after

dependency reduction with R = 2.

15 / 34

Page 16: On Improving the Execution of Distributed CnC Programs

Dependency Reduction

I The dependency diagram of the Johnson 3D algorithm afterdependency reduction with R = N.

16 / 34

Page 17: On Improving the Execution of Distributed CnC Programs

Dependency Reduction

I Related work: OpenMP chunk size.I Merge multiple loop bodies into one serial task, before being

allocated to a thread.I Increase the work’s granularity.I Improves program’s scalability.

I The OpenMP chunk size is similar to the reduction factor R.I Difference between OpenMP chunk size and PIPES

dependency reduction.I OpenMP only performs data parallelism.I PIPES dependency reduction can support task parallelism.I PIPES dependency reduction can support the case when R =

N, i.e. collapsing the entire dimension. Where OpenMPchunk size must be a constant.

I PIPES dependency reduction also removes intermediate results.

17 / 34

Page 18: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Minimizes the number of tasks in flight by enforcing adynamic prescription schedule, also known as creation andspawning schedule.

I Determine when tasks are created and spawned.I Minimize the number of waiting tasks.

I Narrowing down the run-times scheduling options.

I Potentially improving the program’s locality.I Similarly, the user may specify a prescription factor K.

I The size of the task set of each spawn.

18 / 34

Page 19: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Using the MMC in Johnson 3D as an example.I Original version (K=N):

I env::MMC(i,j,k) 0 ≤ i, j, k ≤ n

19 / 34

Page 20: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Dynamic Prescription (K=1):I env::MMC(i,j,0) 0 ≤ i, j ≤ nI MMC(i,j,k)::MMC(i,j,k+1) 0 ≤ i, j ≤ n, 0 ≤ k ≤ n-1

20 / 34

Page 21: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Dynamic Prescription (K=2):I env::MMC(i,j,0), MMC(i,j,1) 0 ≤ i, j ≤ nI MMC(i,j,k)::MMC(i,j,k+1), MMC(i,j,k+2) 0 ≤ i, j ≤ n, 0 ≤ k

≤ n - 2, k mod 2 = 1

21 / 34

Page 22: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Dynamic Prescription (K=4):I env::MMC(i,j,0), MMC(i,j,1),MMC(i,j,2), MMC(i,j,3) 0 ≤ i, j

≤ nI MMC(i,j,k)::MMC(i,j,k+1), MMC(i,j,k+2),MMC(i,j,k+3),

MMC(i,j,k+4) 0 ≤ i, j ≤ n, 0 ≤ k ≤ n - 4, k mod 4 = 3

22 / 34

Page 23: On Improving the Execution of Distributed CnC Programs

Dynamic Prescription

I Related work: cilk for.I cilk for can divide the loop into chunks.I Grain size: the maximum number of iterations in each chunk.I #pragma cilk grainsize = expression

I Grain size is similar to the prescription factor K.I Difference between cilk for and PIPES dynamic prescription.

I cilk for only performs data parallelism.I PIPES dynamic prescription can support task parallelism.I PIPES dynamic prescription can support prescription between

different kernels.

23 / 34

Page 24: On Improving the Execution of Distributed CnC Programs

Complexity Analysis

I No dependency reduction on MMR.

I Dynamic prescription on MMC. (K = 1, 2, 4)

Original K = 1 K = 2 K = 4

env::MMC N3 N2 2N2 4N2

env::MMR N3 N3 N3 N3

MMC::MMC 0 N3 − N2 N3 − 2N2 N3 − 4N2

MMC::MMR 0 0 0 0

MMR::MMR 0 0 0 0

Theoretical CPL N + 1 N + 1 N + 1 N + 1

24 / 34

Page 25: On Improving the Execution of Distributed CnC Programs

Complexity Analysis

I Dependency reduction on MMR. (R = N)

I Dynamic prescription on MMC. (K = 1, 2, 4)

Original K = 1 K = 2 K = 4

env::MMC N3 N2 2N2 4N2

env::MMR N2 N2 N2 N2

MMC::MMC 0 N3 − N2 N3 − 2N2 N3 − 4N2

MMC::MMR 0 0 0 0

MMR::MMR 0 0 0 0

Theoretical CPL 2 2 2 2

25 / 34

Page 26: On Improving the Execution of Distributed CnC Programs

Experimental Setup

I All experiments were performed on Davinci Cluster at RiceUniversity.

I The following table shows the detailed configuration.

Parameters Value

Nodes 1-8Processor Intel Xeon X5660 @ 2.80 GHz

Sockets per node 2Cores per socket 6

InfiniBand QDR bandwidth 40 GB/sL1 Cache 32 KB per coreL2 Cache 256 KB per coreL3 Cache 12 MB per socket

CnC 1.01MPI run-time Intel MPI 5.0

Compiler Intel ICPC 13Slurm 2.6.5

26 / 34

Page 27: On Improving the Execution of Distributed CnC Programs

Experimental Setup

I Matrix Size: 8000 * 8000.

I 1, 2, 4, 8 nodes * 12 processors per node

I Tile Size: 400, 500, 800, 1000, 1600, 2000.

I All transformations were manually implemented.I CnC tuners were used.

I Dependency consumer.I computed onI consumed on

27 / 34

Page 28: On Improving the Execution of Distributed CnC Programs

Performance Result

I We applied our proposed transformations on the Johnson 3Dalgorithm.

I Dependency reduction on MMR (R = N)I Dynamic prescription on MMC (K = 1, 2, 4)I Dynamic prescription on both MMC and MMR (K = 1, 2, 4)

I Adding MMC(i,j,k)::MMR(i,j,k) 0 ≤ i, j, k ≤ n

I Dynamic reduction (K = 1, 2, 4)I Dependency reduction on MMR (R = N), plus dynamic

prescription on MMC (K = 1, 2, 4)

I We obtained 30% speedup when combining the proposedtransformations comparing to the base version.

28 / 34

Page 29: On Improving the Execution of Distributed CnC Programs

Performance ResultI Dependency reduction on MMR (R = N)I Dynamic prescription on MMC (K = 1, 2, 4)

29 / 34

Page 30: On Improving the Execution of Distributed CnC Programs

Performance Result

I Dynamic prescription on both MMC and MMR (K = 1, 2, 4)

30 / 34

Page 31: On Improving the Execution of Distributed CnC Programs

Performance Result

I Dynamic reduction (R = N, K = 1, 2, 4)

31 / 34

Page 32: On Improving the Execution of Distributed CnC Programs

Conclusion

I The overhead of task scheduling in distributed CnC programsis non-negligible.

I We proposed two transformations for overhead reduction:I Dependency reduction.I Dynamic prescription.

I Our preliminary results obtaining 30% speedup by applyingour proposed transformations on Johnson distributedmatrix-multiply algorithm.

32 / 34

Page 33: On Improving the Execution of Distributed CnC Programs

Ongoing Work

I Currently we are focusing on dynamic prescription.I Degree of freedom (dof): a property of task scheduling.

I We have identified several dofs.I Manipulator: concentrate the prescription on as few tasks as

possible.I Balanced: try to have more tasks being in charge of

prescription operations.I Phased: all tasks of A should finish before any task of B starts.I Interleaved: some task of A should finish before starting some

task of B.

I More dofs to discover.

I Policies: combinations of dofs.I Policies determine runtime behavior.I Policies are applicable program-wide, or a subset of tasks.

33 / 34

Page 34: On Improving the Execution of Distributed CnC Programs

ReferencesI R. Agarwal et al., ”A Three-dimensional Approach to Parallel

Matrix Multiplication,” IBM Journal of Research andDevelopment, vol. 39, no. 5, pp. 575582, Sept 1995.

I Chandramowlishwaran, Knobe, and Vuduc. ”Performanceevaluation of concurrent collections on high-performancemulticore computing systems.” IPDPS, 2010.

I Sbirlea, Alina, Louis-Noel Pouchet, and Vivek Sarkar. ”Dfgran intermediate graph representation for macro-dataflowprograms.” Data-Flow Execution Models for Extreme ScaleComputing (DFM), 2014 Fourth Workshop on. IEEE, 2014.

I Sbrlea, Alina, et al. ”Polyhedral Optimizations for aData-Flow Graph Language.” International Workshop onLanguages and Compilers for Parallel Computing. SpringerInternational Publishing, 2015.

I M. Kong, L-N. Pouchet, P. Sadayappan, and V. Sarkar.”Pipes: A language and compiler for distributed memory taskparallelism.” SC 16. IEEE, 2016.

34 / 34