distributed communication-aware load balancing with...

25
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration with the Charm++ Team from the PPL (UIUC, IL) : Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale July 1, 2014 Francois Tessier TreeMatch in Charm++ 1 / 19

Upload: others

Post on 10-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Distributed communication-aware load balancing withTreeMatch in Charm++

The 9th Scheduling for Large Scale Systems Workshop, Lyon, France

Emmanuel Jeannot Guillaume Mercier Francois TessierIn collaboration with the Charm++ Team from the PPL (UIUC, IL) :

Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale

July 1, 2014

Francois Tessier TreeMatch in Charm++ 1/ 19

Page 2: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Introduction

Scalable execution of parallel applications

Number of cores is increasing

But memory per core is decreasing

Application will need to communicate even more than now

Our solution

Process placement should take into account process affinityHere: load balancing in Charm++ considering :

CPU loadprocess affinity (or other communicating objects)topology : network and intra-node

Francois Tessier TreeMatch in Charm++ 2/ 19

Page 3: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Charm++

Features

Parallel object-oriented programming language based on C++

Programs are decomposed into a number of cooperating message-drivenobjects called chares.In general we have more chares than processing units

Chares are mapped to physical processors by an adaptive runtime system

Load balancers can be called to migrate chares

Charm++ is able to use MPI for the processes communications

Francois Tessier TreeMatch in Charm++ 3/ 19

Page 4: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Processes Placement

Why we should consider it

Many current and future parallel platforms have several levels of hierarchy

Application chares/processes do not exchange the same amount of data(affinity)The process placement policy may have impact on performance

Cache hierarchy, memory bus, high-performance network...

Switch

Cabinet Cabinet

... Node Node

... Processor Processor

Core Core Core Core

Francois Tessier TreeMatch in Charm++ 4/ 19

Page 5: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Problems

Given

The parallel machine topology

The application communication pattern

Map application processes to physical resources (cores) to reduce thecommunication costs (NP-complete)

5 10 15

510

15

zeus16.map

Sender rank

Rec

eive

r ra

nk

01

23

45

67

Francois Tessier TreeMatch in Charm++ 5/ 19

Page 6: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

TreeMatch

The TreeMatch Algorithm

Algorithm and environment to compute processes placement based onprocesses affinities and NUMA topologyInput :

The communication pattern of the applicationPreliminary execution with a monitored MPI implementation for staticplacementDynamic recovery on iterative applications with Charm++

A model (tree) of the underlying architecture : Hwloc can provide us this.Output :

A processes permutation σ such that σi is the core number on which wehave to bind the process i

TreeMatch can only work on tree topologies. How to deal with 3d torus ?

Francois Tessier TreeMatch in Charm++ 6/ 19

Page 7: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Network placement

libtopomap

T. Hoefler and M. Snir, "Generic Topology Mapping Strategies forLarge-Scale Parallel Architectures" Proc. Int’l Conf. Supercomputing(ICS), pp. 75-84, 2011.

Library that enables to map processes on various network topologies

Used in TreeMatchLB to consider the Blue Waters 3d torus

Figure: 3d Torus and a Cray Gemini router

Francois Tessier TreeMatch in Charm++ 7/ 19

Page 8: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Load balancing

Principle

Iterative applications

load balancer called at regular interval

Migrate chares in order to optimize several criteriaCharm++ runtime system provides:

chares loadchares affinityetc. . .

Constraints

Dealing with complex modern architectures

Taking into account communications between elements

Some other communication-aware load-balacing algorithms

[L. L. Pilla, et al. 2012] NUCOLB, shared memory machines

[L. L. Pilla, et al. 2012] HwTopoLB

Some "built-in" Charm++ load balancers : RefineCommLB,GreedyCommLB. . .

Francois Tessier TreeMatch in Charm++ 8/ 19

Page 9: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Several issues raised

Not so easy...

Several issues raised!

Scalability of TreeMatchHow to deal with process mapping (user, core numbering)

Intel Xeon 5550 : 0,2,4,6,1,3,5,7Intel Xeon 5550 : 0,1,2,3,4,5,6,7 (!!)AMD Interlagos : 0,1,2,3,4,5,6,7...,30,31

Need to find a relevant compromise between processes affinities and loadbalancing

What about load balancing time?

The next slides will present our load balancer relying on TreeMatch andlibtopomap which performs a parallel and distributed communication-awareload balancing.

Francois Tessier TreeMatch in Charm++ 9/ 19

Page 10: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Network Placement

First step : minimize communication cost on network

libtopomap reorders processes from a communicatorHow to use it to reorder groups of processes (or chares) ? Example :groups of chares on nodes

Charm++ uses MPI : full access to the MPI APINew MPI communicator with MPI_Comm_split

0 1 2 3

Network (3d torus, tree, …)

Nodes

New communicator

Francois Tessier TreeMatch in Charm++ 10 / 19

Page 11: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

3 6 8 9 12 14 15 16

Groups of chares assigned to nodes

CP

U L

oad

Network (3d torus, hierarchical, …)

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 12: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel Figure: Part of a 3d Torus attributed by

the resource manager

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 13: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

CP

U L

oad

Network (3d torus, hierarchical, …)

3 ...

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 14: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6

Chares

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 15: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6

Chares

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 16: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

CP

U L

oad

Network (3d torus, hierarchical, …)

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 17: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

commBench

Benchmark designed to simulate irregular communications

Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) -Blue Waters Cluster)

1 MB messages - 100 iterations - 2 distant receivers for each chare

050

100

150

Dum

myL

B

Ref

ineC

omm

LB

Tree

Mat

chLB

Ave

rage

tim

e of

one

iter

atio

n in

ms

commBench on 512 cores8192 elements − 1MB message size

Francois Tessier TreeMatch in Charm++ 12 / 19

Page 18: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

commBench

1 MB messages - 100 iterations - 2 distant receivers for each chare

TreeMatch applied on a chares communication matrix

5 10 15

510

15

Chares comm matrix − CommBench − 1 PlaFRIM node

Sender rank

Rec

eive

r ra

nk

010

0020

0030

0040

00

TreeMatch

5 10 15

510

15

Chares comm matrix − CommBench − 1 PlaFRIM node

Sender rank

Rec

eive

r ra

nk

010

0020

0030

0040

00

Figure: σ(i) = 0, 8, 4, 5, 12, 1, 9, 6, 14, 2, 3, 13, 7, 10, 11, 15

Francois Tessier TreeMatch in Charm++ 13 / 19

Page 19: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

kNeighbor

Benchmarks application designed to simulate regular intensivecommunication between processes

Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIMClusterParticularly compared to RefineCommLB

Takes into account load and communicationMinimizes migrations

Dum

myL

B

Gre

edyC

omm

LB

Gre

edyL

B

Ref

ineC

omm

LB

TM

LB_T

reeB

ased

Exe

cutio

n tim

e (in

sec

onds

)

0

100

200

300

400

500

600

700

kNeighbor on 64 cores128 elements − 1MB message size

Dum

myL

B

Gre

edyC

omm

LB

Gre

edyL

B

Ref

ineC

omm

LB

TM

LB_T

reeB

ased

Exe

cutio

n tim

e (in

sec

onds

)

0

500

1000

1500

2000

kNeighbor on 64 cores256 elements − 1MB message size

Francois Tessier TreeMatch in Charm++ 14 / 19

Page 20: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

kNeighbor

Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) -PlaFRIM Cluster

1 MB messages - 100 iterations - 7-Neighbor

40

60

80

100

120

140

160

180

200

1 2 4 8 16

Avera

ge t

ime f

or

each

7-k

Neig

hb

or

itera

tion (

in m

s)

Number of chares by core

Execution time versus chares by core

DummyLB 0-7TreeMatchLB

DummyLB 0,2,4,6,1,3,5,7

Francois Tessier TreeMatch in Charm++ 15 / 19

Page 21: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

What about the load balancing time?

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Sequential

Distributed

Sequential

Distributed

Sequential

Distributed

Tim

e in s

eco

nd

s

Time repartition for each step of the load balancing process

InitializationTM Sequential

TM Parallel

1638481924096

Francois Tessier TreeMatch in Charm++ 16 / 19

Page 22: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

What about the load balancing time?

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

7

6

5

4

3

2

1

0

Master

165.6 165.7 165.8 165.9 166 166.1 166.2

time

4096 Chares - reverse - Par

InitProcess results

DistributeCalculate

Return

Francois Tessier TreeMatch in Charm++ 16 / 19

Page 23: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Results

What about the load balancing time?

Linear trajectory while the number of chares is doubled

TreeMatchLB is slower than the other Greedy strategies

RefineCommLB which provides some good results forcommunication-bound applications is not scalable (fails from 8192 chares)

0.1

1

10

100

1000

10000

128 256 512 1024 2048 4096 8192

Execu

tion t

ime (

in m

s)

Number of chares

Execution time of load balancingstrategies (running on 128 cores)

GreedyCommLBGreedyLB

RefineCommLBTreeMatchLB

Figure: Load balancing time of the different strategies vs. number of chares for theKNeighbor application.

Francois Tessier TreeMatch in Charm++ 17 / 19

Page 24: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Future work and Conclusion

The end

Topology is not flat!

Processes affinities are not homogeneous

Take into account these information to map chares give us improvement

Algorithm adapted to large problems (Distributed)

Published at IEEE Cluter 2013

Future work

Find a better way to gather the topology (Hwloc?)

Improve network part (BGQ routing ?)

Perform more large scale experiments

Evaluate our solution on other applications (CFD ?)

Francois Tessier TreeMatch in Charm++ 18 / 19

Page 25: Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

The End

Thanks for your attention !Any questions?

Francois Tessier TreeMatch in Charm++ 19 / 19