scalable and topology-aware load balancers in charm++ amit sharma parallel programming lab, uiuc
DESCRIPTION
Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of chares to processorsTRANSCRIPT
Scalable and Topology-Aware Load Balancers in Charm++
Amit SharmaParallel Programming Lab, UIUC
Outline Dynamic Load Balancing framework in
Charm++ Load Balancing on Large Machines
Scalable Load Balancer Topology-aware Load Balancers
Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++
Given a collection of migratable objects and a set of computers connected in a certain topology
Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum
Dynamic mapping of chares to processors
Load-Balancing Approaches Two major approaches
No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound, .. Seed load balancers
With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy
Principle of Persistence Once an application is expressed in terms of
interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior
Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)
Parallel analog of principle of locality Heuristics, that hold for most CSE applications
Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)
communication volume and computation time Measurement based load balancers
Use the database periodically to make new decisions Many alternative strategies can use the database
Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware
Load Balancer Strategies Centralized
Object load data are sent to processor 0
Integrate to a complete object graph
Migration decision is broadcasted from processor 0
Global barrier
Distributed Load balancing
among neighboring processors
Build partial object graph
Migration decision is sent to its neighbors
No global barrier
Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on
extremely large machines Consider an application with 1M objects on 64K
processors Limiting factors and issues
Decision-making algorithm Difficult to achieve well-informed load balancing
decisions Resource limitations
Limitations of Centralized Strategies
Effective on small number of processors, easy to achieve good load balance
Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow
We demonstrate these limitations using the simulator we developed
Memory Overhead (simulation results with lb_test)
050
100150200250300350400450500
LB Memory usage on the central node
(MB)
128K 256K 512K 1MNumber of objects
32K processors64K processors
Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.
Run on Lemieux 64 processors
Load Balancing Execution Time
050
100150200250300350400
Execution Time (in seconds)
128K 256K 512K 1MNumber of Objects
GreedyLBGreedyCommLBRefineLB
Execution time of load balancing algorithms on 64K processor simulation
Why Hierarchical LB? Centralized load balancer
Bottleneck for communication on processor 0 Memory constraint
Fully distributed load balancer Neighborhood balancing Without global load information
Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors
A Hybrid Load Balancing Strategy Dividing processors into independent sets of
groups, and groups are organized in hierarchies (decentralized)
Each group has a leader (the central node) which performs centralized load balancing
A particular hybrid strategy that works well
Gengbin Zheng, PhD Thesis, 2005
Hierarchical Tree (an example)
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
64K processor hierarchical tree
•Apply different strategies at each level
Level 0
Level 1
Level 2
1024
64
Our HybridLB Scheme
0 … 1023 6553564512 …1024 … 2047 6451163488 ……...
0 1024 63488 64512
1
Load Data (OCG)
Refinement-based Load balancing
Greedy-based Load balancing
Load Data
token
object
Simulation Study - Memory Usage
050
100150200250300350400450500
Memory usage (MB)
256K 512K 1M
Number of Objects
lb_test simulation of 64K processors
CentralLBHybridLB
Simulation of lb_test benchmark with the performance simulator
Total Load Balancing Time
050
100150200250300350400450
Time(s)
256K 512K 1M
Number of Objects
lb_test simulation of 64K processors
GreedyCommLBHybridLB(GreedyCommLB)
Load Balancing Quality
00.020.040.060.08
0.10.12
Maximum predicted load (seconds)
256K 512K 1MNumber of Objects
lb_test simulation of 64K processors
GreedyCommLBHybridLB
Topology-aware mapping of tasks Problem
Map tasks to processors connected in a topology, such that:
Compute load on processors is balanced Communicating chares (objects) are placed on
nearby processors.
Mapping Model Task Graph :
Gt = (Vt , Et) Weighted graph, undirected edges Nodes chares, w(va) computation Edges communication, cab bytes between va and vb
Topology-graph : Gp = (Vp , Ep) Nodes processors Edges Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube
Model (Cont.) Task Mapping
Assigns tasks to processors P : Vt Vp
Hop-Bytes Hop-Bytes Communication cost The cost imposed on the network is more if
more links are used Weigh inter-processor communication by
distance on the network))(),(( ba
Eeab vPvPdcHB
tab
Metric Metric
Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte
Average hops traveled by a byte under a task-mapping.
TopoLB: Topology-aware LBOverview
First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group
communication Can use GreedyLB, GreedyCommLB, etc.
Repeat n times: Pick a task t and processor p Place t on p (P(t) p)
Tarun Agarwal, MS Thesis, 2005
Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that:
Cost of placing t on p is approximated as:
cost(t;p) = Pm2Assigned Tasks
ctmd(P(m);p)
+ Pm2Unassigned Tasks
ctmd(pö;p)
where HB(t) = Pm2Vt
ctmd(P(m);P(t))
Hop Bytes = Pt2Vt
HB(t)
Picking t,p (Cont.) Criticality of placing t in this iteration:
By how much will the cost of placing t increase in the future?
Future Cost : t will be placed on some random processor in future iteration
Criticality of t :
future_ cost(t) = jFree Procsj
Ppi 2 Free Procs cost(t;pi)
criticality(t) = future_ cost(t) à cost(t;pbest)
Putting it together
TopoCentLB: A faster Topology-aware LB Coalesce task graph to n nodes. (n=number of
processors) Picking task t, processor p
t is the task which has maximum total communication with already assigned tasks
p is the processor where t costs least
cost(t;p) = Pm2Assigned Tasks
ctmd(P(m);p)
TopoCentLB (Cont.) Difference from TopoLB
No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future
Running Complexity TopoLB: Depends on the criticality function
O(p|Et|) O(p3)
TopoCentLB O(p|Et|) (with a smaller constant than TopoLB)
Results Compare TopoLB, TopoCentLB, Random
Placement Charm++ LB Simulation mode
2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes
BlueGene/L 2D-Jacobi like benchmark Reduction in running time
Simulation Results2D-Mesh pattern on a 3D-Torus topology (same
size)
Simulation ResultsLeanMD on 3D-Torus
Experimental Results: Bluegene2D-Mesh pattern on a 3D-Torus (Msg
Size:100KB)
Experimental Results: Bluegene2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)
Conclusions Need for Scalable LBs in future due to large
machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication
localized in a neighborhood Efficient topology-aware task mapping strategies
which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth
constraints