on improving the execution of distributed cnc programs

On Improving the Execution of Distributed CnCPrograms

Yuhan Peng1, Martin Kong1, Louis-Noel Pouchet2, VivekSarkar1

1Department of Computer ScienceRice University

2Department of Computer ScienceColorado State University

CNC-2016 Workshop, September 2016

1 / 34

Concurrent Collections (CnC)

I Run-time and data-flow model for parallel programming.I No direct specification of parallel operations.

I The user specifies the semantics with data and controldependencies.

I The runtime decides the schedule of parallel tasks.

I Applicable on both shared and distributed memory.I Can reach performance comparable to OpenMP/MPI

applications.

2 / 34

Data-Flow Graph Language (DFGL)

I Intermediate graph representation for macro dataflowprograms.

I Emphasizes the data dependencies between tasks.

I User-friendly, expressive language.I DFGL provides great opportunities for performing high-level

optimizationsI Optimizations can be done through graph and loop

transformations.I Especially good for polyhedral optimizations when program

exhibits regularity.

3 / 34

Data-Flow Graph Language (DFGL)

I Automatic code generation tools can transform DFGL intoCnC code for being executed.

I DFGL framework.1

1Sbirlea, Alina, Louis-Noel Pouchet, and Vivek Sarkar. ”Dfgr anintermediate graph representation for macro-dataflow programs.” Data-FlowExecution Models for Extreme Scale Computing (DFM), 2014 FourthWorkshop on. IEEE, 2014.

4 / 34

PIPES

I Programming language and compiler derived from DFGL.I Input: DFGL with producer and consumer relations, with other

language abstractions.I Output: Intel CnC C++ compilable program.

I Concentrates on virtual topologies and task mappings.

I Automatically applying optimization transformations such astask coarsening and coalescing.

I Goal: better supporting task-based programming for sharedand distributed memory.

5 / 34

PIPES

I Framework of PIPES.2

I Great support for adding new optimization pass in PIPES core.

2M. Kong, L-N. Pouchet, P. Sadayappan, and V. Sarkar. ”Pipes: Alanguage and compiler for distributed memory task parallelism.” SC 16. IEEE,2016.

6 / 34

Motivation

I Managing and controlling the runtime overhead is crucial.I In practice, such overhead depends on:

I The total number of tasks created.I The number of tasks in flight at a give time point.I The total number of input dependencies.

7 / 34

MotivationI Johnson 3D matrix multiply algorithm: our motivating

example.I Introduced by Ramesh C. Agarwal et al. in 1995.I Parallelizable divide-and-conquer approach.

I To compute the product of A * B, Johnson 3D goes throughtwo steps:

I MMC: Divide A and B into small matrix pieces, and multiplythe small matrix pieces in parallel.

I MMR: Reduction to sum up the results of small matrix pieces.

8 / 34

Motivation

I We start by dissecting the program execution time of theJohnson 3D algorithm.

I We tested the algorithm across different tile sizes.I We tested the algorithm across different number of nodes and

task mappings.

I The program overhead is a non-negligible portion of the totaltime.

I In distributed Johnson 3D algorithm, the overhead can takebetween 2% and 50% of the total execution time.

9 / 34

MotivationI The overhead of the execution of Johnson 3D on different

number of processors.I We use 1-8 nodes where each node has 12 processors.I More processors, larger overhead proportion.I The overhead grows superlinearly.

10 / 34

Our Approach

I Our goal: minimize the run-time overhead.I We proposed two transformation techniques:

I Dependency reduction.I Dynamic prescription.

11 / 34

Dependency Reduction

I Objective: minimizing the number of tasks and/ordependencies.

I Avoids needless polling of satisfied dependencies.I Depending on the runtime scheduler.

I Improves the program’s progress.I Reduce the critical path length.I Minimize the number of task instances and block instances.I The processors will have fewer tasks to handle, and fewer

dependencies to query.

12 / 34


I The user may specify a reduction factor R.I Then we transform the dataflow graph, so that

I The semantic of the input DFGL does not change.I Minimize the total number of dependencies.I Essentially contracts one dimension by a factor of R.

I If R = 1, no change.I If R = 2, every two instances are fused. i.e. N / 2 instances

left.I If R = N, all instances are fused. i.e. dimension collapses.

13 / 34


I The dependency diagram of the original Johnson 3Dalgorithm.

14 / 34

Dependency ReductionI The dependency diagram of the Johnson 3D algorithm after

dependency reduction with R = 2.

15 / 34


I The dependency diagram of the Johnson 3D algorithm afterdependency reduction with R = N.

16 / 34


I Related work: OpenMP chunk size.I Merge multiple loop bodies into one serial task, before being

allocated to a thread.I Increase the work’s granularity.I Improves program’s scalability.

I The OpenMP chunk size is similar to the reduction factor R.I Difference between OpenMP chunk size and PIPES

dependency reduction.I OpenMP only performs data parallelism.I PIPES dependency reduction can support task parallelism.I PIPES dependency reduction can support the case when R =

N, i.e. collapsing the entire dimension. Where OpenMPchunk size must be a constant.

I PIPES dependency reduction also removes intermediate results.

17 / 34

Dynamic Prescription

I Minimizes the number of tasks in flight by enforcing adynamic prescription schedule, also known as creation andspawning schedule.

I Determine when tasks are created and spawned.I Minimize the number of waiting tasks.

I Narrowing down the run-times scheduling options.

I Potentially improving the program’s locality.I Similarly, the user may specify a prescription factor K.

I The size of the task set of each spawn.

18 / 34


I Using the MMC in Johnson 3D as an example.I Original version (K=N):

I env::MMC(i,j,k) 0 ≤ i, j, k ≤ n

19 / 34


I Dynamic Prescription (K=1):I env::MMC(i,j,0) 0 ≤ i, j ≤ nI MMC(i,j,k)::MMC(i,j,k+1) 0 ≤ i, j ≤ n, 0 ≤ k ≤ n-1

20 / 34


I Dynamic Prescription (K=2):I env::MMC(i,j,0), MMC(i,j,1) 0 ≤ i, j ≤ nI MMC(i,j,k)::MMC(i,j,k+1), MMC(i,j,k+2) 0 ≤ i, j ≤ n, 0 ≤ k

≤ n - 2, k mod 2 = 1

21 / 34


I Dynamic Prescription (K=4):I env::MMC(i,j,0), MMC(i,j,1),MMC(i,j,2), MMC(i,j,3) 0 ≤ i, j

≤ nI MMC(i,j,k)::MMC(i,j,k+1), MMC(i,j,k+2),MMC(i,j,k+3),

MMC(i,j,k+4) 0 ≤ i, j ≤ n, 0 ≤ k ≤ n - 4, k mod 4 = 3

22 / 34


I Related work: cilk for.I cilk for can divide the loop into chunks.I Grain size: the maximum number of iterations in each chunk.I #pragma cilk grainsize = expression

I Grain size is similar to the prescription factor K.I Difference between cilk for and PIPES dynamic prescription.

I cilk for only performs data parallelism.I PIPES dynamic prescription can support task parallelism.I PIPES dynamic prescription can support prescription between

different kernels.

23 / 34

Complexity Analysis

I No dependency reduction on MMR.

I Dynamic prescription on MMC. (K = 1, 2, 4)

Original K = 1 K = 2 K = 4

env::MMC N3 N2 2N2 4N2

env::MMR N3 N3 N3 N3

MMC::MMC 0 N3 − N2 N3 − 2N2 N3 − 4N2

MMC::MMR 0 0 0 0

MMR::MMR 0 0 0 0

Theoretical CPL N + 1 N + 1 N + 1 N + 1

24 / 34

Complexity Analysis

I Dependency reduction on MMR. (R = N)

I Dynamic prescription on MMC. (K = 1, 2, 4)

Original K = 1 K = 2 K = 4

env::MMC N3 N2 2N2 4N2

env::MMR N2 N2 N2 N2

MMC::MMC 0 N3 − N2 N3 − 2N2 N3 − 4N2

MMC::MMR 0 0 0 0

MMR::MMR 0 0 0 0

Theoretical CPL 2 2 2 2

25 / 34

Experimental Setup

I All experiments were performed on Davinci Cluster at RiceUniversity.

I The following table shows the detailed configuration.

Parameters Value

Nodes 1-8Processor Intel Xeon X5660 @ 2.80 GHz

Sockets per node 2Cores per socket 6

InfiniBand QDR bandwidth 40 GB/sL1 Cache 32 KB per coreL2 Cache 256 KB per coreL3 Cache 12 MB per socket

CnC 1.01MPI run-time Intel MPI 5.0

Compiler Intel ICPC 13Slurm 2.6.5

26 / 34

Experimental Setup

I Matrix Size: 8000 * 8000.

I 1, 2, 4, 8 nodes * 12 processors per node

I Tile Size: 400, 500, 800, 1000, 1600, 2000.

I All transformations were manually implemented.I CnC tuners were used.

I Dependency consumer.I computed onI consumed on

27 / 34

Performance Result

I We applied our proposed transformations on the Johnson 3Dalgorithm.

I Dependency reduction on MMR (R = N)I Dynamic prescription on MMC (K = 1, 2, 4)I Dynamic prescription on both MMC and MMR (K = 1, 2, 4)

I Adding MMC(i,j,k)::MMR(i,j,k) 0 ≤ i, j, k ≤ n

I Dynamic reduction (K = 1, 2, 4)I Dependency reduction on MMR (R = N), plus dynamic

prescription on MMC (K = 1, 2, 4)

I We obtained 30% speedup when combining the proposedtransformations comparing to the base version.

28 / 34

Performance ResultI Dependency reduction on MMR (R = N)I Dynamic prescription on MMC (K = 1, 2, 4)

29 / 34

Performance Result

I Dynamic prescription on both MMC and MMR (K = 1, 2, 4)

30 / 34

Performance Result

I Dynamic reduction (R = N, K = 1, 2, 4)

31 / 34

Conclusion

I The overhead of task scheduling in distributed CnC programsis non-negligible.

I We proposed two transformations for overhead reduction:I Dependency reduction.I Dynamic prescription.

I Our preliminary results obtaining 30% speedup by applyingour proposed transformations on Johnson distributedmatrix-multiply algorithm.

32 / 34

Ongoing Work

I Currently we are focusing on dynamic prescription.I Degree of freedom (dof): a property of task scheduling.

I We have identified several dofs.I Manipulator: concentrate the prescription on as few tasks as

possible.I Balanced: try to have more tasks being in charge of

prescription operations.I Phased: all tasks of A should finish before any task of B starts.I Interleaved: some task of A should finish before starting some

task of B.

I More dofs to discover.

I Policies: combinations of dofs.I Policies determine runtime behavior.I Policies are applicable program-wide, or a subset of tasks.

33 / 34

ReferencesI R. Agarwal et al., ”A Three-dimensional Approach to Parallel

Matrix Multiplication,” IBM Journal of Research andDevelopment, vol. 39, no. 5, pp. 575582, Sept 1995.

I Chandramowlishwaran, Knobe, and Vuduc. ”Performanceevaluation of concurrent collections on high-performancemulticore computing systems.” IPDPS, 2010.

I Sbirlea, Alina, Louis-Noel Pouchet, and Vivek Sarkar. ”Dfgran intermediate graph representation for macro-dataflowprograms.” Data-Flow Execution Models for Extreme ScaleComputing (DFM), 2014 Fourth Workshop on. IEEE, 2014.

I Sbrlea, Alina, et al. ”Polyhedral Optimizations for aData-Flow Graph Language.” International Workshop onLanguages and Compilers for Parallel Computing. SpringerInternational Publishing, 2015.

I M. Kong, L-N. Pouchet, P. Sadayappan, and V. Sarkar.”Pipes: A language and compiler for distributed memory taskparallelism.” SC 16. IEEE, 2016.

34 / 34

on improving the execution of distributed cnc programs

Documents