hierarchical load balancing for large scale supercomputers

23
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1 Charm++ Workshop 2010

Upload: arella

Post on 23-Feb-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Hierarchical Load Balancing for Large Scale Supercomputers. Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC. Outline. Dynamic Load Balancing framework in Charm++ Motivations Hierarchical load balancing strategy. Charm++ Dynamic Load-Balancing Framework. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Load Balancing for Large Scale Supercomputers

Hierarchical Load Balancing for Large Scale Supercomputers

Gengbin ZhengCharm++ Workshop 2010

Parallel Programming Lab, UIUC

1Charm++ Workshop 2010

Page 2: Hierarchical Load Balancing for Large Scale Supercomputers

Outline Dynamic Load Balancing framework in

Charm++ Motivations Hierarchical load balancing strategy

2Charm++ Workshop 2010

Page 3: Hierarchical Load Balancing for Large Scale Supercomputers

Charm++ Dynamic Load-Balancing Framework One of the most popular reasons

to use Charm++/AMPI Fully automatic Adaptive Application independent Modular, and extendable

3Charm++ Workshop 2010

Page 4: Hierarchical Load Balancing for Large Scale Supercomputers

Principle of Persistence Principle of Persistence

Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time

In spite of dynamic behavior Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)

Parallel analog of principle of locality Heuristics, that holds for most CSE applications

4Charm++ Workshop 2010

Page 5: Hierarchical Load Balancing for Large Scale Supercomputers

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)

communication volume and computation time Measurement based load balancers

Use the database periodically to make new decisions Many alternative strategies can use the database

Centralized vs distributed Greedy vs refinement Taking communication into account Taking dependencies into account (More complex) Topology-aware

5Charm++ Workshop 2010

Page 6: Hierarchical Load Balancing for Large Scale Supercomputers

Load Balancer Strategies Centralized

Object load data are sent to processor 0

Integrate to a complete object graph

Migration decision is broadcasted from processor 0

Global barrier

Distributed Load balancing

among neighboring processors

Build partial object graph

Migration decision is sent to its neighbors

No global barrier6Charm++ Workshop 2010

Page 7: Hierarchical Load Balancing for Large Scale Supercomputers

Limitations of Centralized Strategies

Now consider an application with 1M objects on 64K processors

Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow

We demonstrate these limitations using the simulator we developed

7Charm++ Workshop 2010

Page 8: Hierarchical Load Balancing for Large Scale Supercomputers

Memory Overhead (simulation results with lb_test)

050

100150200250300350400450500

LB M

emor

y us

age

on t

he c

entr

al

node

(MB

)

128K 256K 512K 1MNumber of obj ects

32K cores64K cores

Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.

Run on Lemieux 64 processors

8Charm++ Workshop 2010

Page 9: Hierarchical Load Balancing for Large Scale Supercomputers

Load Balancing Execution Time

0

50

100

150

200

250

300

350

400

Exec

utio

n Ti

me (

in s

econ

ds)

128K 256K 512K 1MNumber of Obj ects

GreedyLBGreedyCommLBRefi neLB

Execution time of load balancing algorithms on 64K processor simulation

9Charm++ Workshop 2010

Page 10: Hierarchical Load Balancing for Large Scale Supercomputers

10

Limitations of Distributed Strategies

Each processor periodically exchange load information and migrate objects among neighboring processors

Performance improved slowly

Lack of global information

Difficult to converge quickly to as good a solution as a centralized strategy

Result with NAMD on 256 processors

Page 11: Hierarchical Load Balancing for Large Scale Supercomputers

A Hybrid Load Balancing Strategy Dividing processors into independent sets of

groups, and groups are organized in hierarchies (decentralized) Aggressive load balancing in sub-groups, combined

with Refinement-based cross-group load balancing

Each group has a leader (the central node) which performs centralized load balancing Reuse existing centralized load balancing

11Charm++ Workshop 2010

Page 12: Hierarchical Load Balancing for Large Scale Supercomputers

Hierarchical Tree (an example)

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

64K processor hierarchical tree

• Apply different strategies at each level

Level 0

Level 1

Level 2

1024

64

12Charm++ Workshop 2010

Page 13: Hierarchical Load Balancing for Large Scale Supercomputers

Issues Load data reduction

Semi-centralized load balancing scheme

Reducing data movement Token-based local balancing

Topology-aware tree construction

Charm++ Workshop 2010 13

Page 14: Hierarchical Load Balancing for Large Scale Supercomputers

Token-based HybridLB Scheme

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

Load Data (OCG)

Refinement-based Load balancing

Greedy-based Load balancing

Load Data

token

object

14Charm++ Workshop 2010

Page 15: Hierarchical Load Balancing for Large Scale Supercomputers

Performance Study with Synthetic Benchmark

lb_test benchmark on Ranger Cluster (1M objects)15Charm++ Workshop 2010

4096 8192 163840

20406080

100120140160180

CentralizedLBHierarchical

Peak

Mem

Usa

ge (M

B)

Page 16: Hierarchical Load Balancing for Large Scale Supercomputers

Load Balancing Time (lb_test)

lb_test benchmark on Ranger Cluster16Charm++ Workshop 2010

4096 8192 1638405

101520253035

CentralizedLBHierarchical

LB T

ime

(s)

Page 17: Hierarchical Load Balancing for Large Scale Supercomputers

Performance (lb_test)

lb_test benchmark on Ranger Cluster17Charm++ Workshop 2010

4096 8192 163840

2

4

6

8

10

12

no LBCentralizedLBHierarchical

Lbte

st s

tep

time

Page 18: Hierarchical Load Balancing for Large Scale Supercomputers

NAMD Hierarchical LB NAMD implements its own

specialized load balancing strategies Based on Charm++ load balancing

framework Extended NAMD comprehensive

and refinement-based solution Work on subset of processors

Charm++ Workshop 2010 18

Page 19: Hierarchical Load Balancing for Large Scale Supercomputers

NAMD LB Time

Charm++ Workshop 2010 19

256 512 1024 204805

1015202530354045

ComprehensiveRefinement

NAM

D Ti

me/

Step

Page 20: Hierarchical Load Balancing for Large Scale Supercomputers

NAMD LB Time (Comprehensive)

Charm++ Workshop 2010 20

512 1024 2048 4096 81920.1

1

10

100

1000

ComprehensiveHierarchical

Load

Bal

ancin

g Ti

me

(s)

Page 21: Hierarchical Load Balancing for Large Scale Supercomputers

NAMD LB Time (Refinement)

Charm++ Workshop 2010 21

512 1024 2048 4096 81920.1

1

10

100

1000

RefinementHierarchical

Load

Bal

ancin

g Ti

me

(s)

Page 22: Hierarchical Load Balancing for Large Scale Supercomputers

NAMD Performance

Charm++ Workshop 2010 22

512 1024 2048 4096 819202468

101214161820

CentralizedHierarchical

NAM

D Ti

me/

Step

Page 23: Hierarchical Load Balancing for Large Scale Supercomputers

Conclusions Scalable LBs are needed due to large machines

like BG/P Avoid memory and communication bottleneck Achieve similar result to the more expensive

centralized load balancer Take processor topology into account

23Charm++ Workshop 2010