scalable and topology-aware load balancers in charm++ amit sharma parallel programming lab, uiuc

Scalable and Topology-Aware Load Balancers in Charm++

Amit SharmaParallel Programming Lab, UIUC

Outline Dynamic Load Balancing framework in

Charm++ Load Balancing on Large Machines

Scalable Load Balancer Topology-aware Load Balancers

Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++

Given a collection of migratable objects and a set of computers connected in a certain topology

Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum

Dynamic mapping of chares to processors

Load-Balancing Approaches Two major approaches

No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound, .. Seed load balancers

With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy

Principle of Persistence Once an application is expressed in terms of

interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior

Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)

Parallel analog of principle of locality Heuristics, that hold for most CSE applications

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)

communication volume and computation time Measurement based load balancers

Use the database periodically to make new decisions Many alternative strategies can use the database

Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware

Load Balancer Strategies Centralized

Object load data are sent to processor 0

Integrate to a complete object graph

Migration decision is broadcasted from processor 0

Global barrier

Distributed Load balancing

among neighboring processors

Build partial object graph

Migration decision is sent to its neighbors

No global barrier

Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on

extremely large machines Consider an application with 1M objects on 64K

processors Limiting factors and issues

Decision-making algorithm Difficult to achieve well-informed load balancing

decisions Resource limitations

Limitations of Centralized Strategies

Effective on small number of processors, easy to achieve good load balance

Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow

We demonstrate these limitations using the simulator we developed

Memory Overhead (simulation results with lb_test)

050

100150200250300350400450500

LB Memory usage on the central node

(MB)

128K 256K 512K 1MNumber of objects

32K processors64K processors

Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.

Run on Lemieux 64 processors

Load Balancing Execution Time

050

100150200250300350400

Execution Time (in seconds)

128K 256K 512K 1MNumber of Objects

GreedyLBGreedyCommLBRefineLB

Execution time of load balancing algorithms on 64K processor simulation

Why Hierarchical LB? Centralized load balancer

Bottleneck for communication on processor 0 Memory constraint

Fully distributed load balancer Neighborhood balancing Without global load information

Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors

A Hybrid Load Balancing Strategy Dividing processors into independent sets of

groups, and groups are organized in hierarchies (decentralized)

Each group has a leader (the central node) which performs centralized load balancing

A particular hybrid strategy that works well

Gengbin Zheng, PhD Thesis, 2005

Hierarchical Tree (an example)

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

64K processor hierarchical tree

•Apply different strategies at each level

Level 0

Level 1

Level 2

1024

64

Our HybridLB Scheme

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

Load Data (OCG)

Refinement-based Load balancing

Greedy-based Load balancing

Load Data

token

object

Simulation Study - Memory Usage

050

100150200250300350400450500

Memory usage (MB)

256K 512K 1M

Number of Objects

lb_test simulation of 64K processors

CentralLBHybridLB

Simulation of lb_test benchmark with the performance simulator

Total Load Balancing Time

050

100150200250300350400450

Time(s)

256K 512K 1M

Number of Objects


GreedyCommLBHybridLB(GreedyCommLB)

Load Balancing Quality

00.020.040.060.08

0.10.12

Maximum predicted load (seconds)

256K 512K 1MNumber of Objects


GreedyCommLBHybridLB

Topology-aware mapping of tasks Problem

Map tasks to processors connected in a topology, such that:

Compute load on processors is balanced Communicating chares (objects) are placed on

nearby processors.

Mapping Model Task Graph :

Gt = (Vt , Et) Weighted graph, undirected edges Nodes chares, w(va) computation Edges communication, cab bytes between va and vb

Topology-graph : Gp = (Vp , Ep) Nodes processors Edges Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube

Model (Cont.) Task Mapping

Assigns tasks to processors P : Vt Vp

Hop-Bytes Hop-Bytes Communication cost The cost imposed on the network is more if

more links are used Weigh inter-processor communication by

distance on the network))(),(( ba

Eeab vPvPdcHB

tab

Metric Metric

Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte

Average hops traveled by a byte under a task-mapping.

TopoLB: Topology-aware LBOverview

First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group

communication Can use GreedyLB, GreedyCommLB, etc.

Repeat n times: Pick a task t and processor p Place t on p (P(t) p)

Tarun Agarwal, MS Thesis, 2005

Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that:

Cost of placing t on p is approximated as:

cost(t;p) = Pm2Assigned Tasks

ctmd(P(m);p)

+ Pm2Unassigned Tasks

ctmd(pö;p)

where HB(t) = Pm2Vt

ctmd(P(m);P(t))

Hop Bytes = Pt2Vt

HB(t)

Picking t,p (Cont.) Criticality of placing t in this iteration:

By how much will the cost of placing t increase in the future?

Future Cost : t will be placed on some random processor in future iteration

Criticality of t :

future_ cost(t) = jFree Procsj

Ppi 2 Free Procs cost(t;pi)

criticality(t) = future_ cost(t) à cost(t;pbest)

Putting it together

TopoCentLB: A faster Topology-aware LB Coalesce task graph to n nodes. (n=number of

processors) Picking task t, processor p

t is the task which has maximum total communication with already assigned tasks

p is the processor where t costs least

cost(t;p) = Pm2Assigned Tasks

ctmd(P(m);p)

TopoCentLB (Cont.) Difference from TopoLB

No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future

Running Complexity TopoLB: Depends on the criticality function

O(p|Et|) O(p3)

TopoCentLB O(p|Et|) (with a smaller constant than TopoLB)

Results Compare TopoLB, TopoCentLB, Random

Placement Charm++ LB Simulation mode

2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes

BlueGene/L 2D-Jacobi like benchmark Reduction in running time

Simulation Results2D-Mesh pattern on a 3D-Torus topology (same

size)

Simulation ResultsLeanMD on 3D-Torus

Experimental Results: Bluegene2D-Mesh pattern on a 3D-Torus (Msg

Size:100KB)

Experimental Results: Bluegene2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)

Conclusions Need for Scalable LBs in future due to large

machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication

localized in a neighborhood Efficient topology-aware task mapping strategies

which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth

constraints

scalable and topology-aware load balancers in charm++ amit sharma parallel programming lab, uiuc

Documents