a multi-agent system approach to load-balancing and resource allocation for distributed computing

36
A Multi-Agent System Approach to Load Balancing and Resource Allocation for Distributed Computing Soumya Banerjee & Joshua Hecker

Upload: soumya-banerjee

Post on 21-Feb-2017

289 views

Category:

Science


1 download

TRANSCRIPT

Page 1: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

A Multi-Agent System Approach toLoad Balancing and Resource Allocation

for Distributed Computing

Soumya Banerjee & Joshua Hecker

Page 2: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Age of distributed computing

� Trend in moving computation on inexpensive but geographically distributed computers

� SETI@home, LHC@home

� Need for efficient allocation algorithms

Motivation

Page 3: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Decentralized Computing

� Can alleviate computing load on centralized monitors

� Robust to single-point failures

� Can achieve application-level resource management (nodes can manage resources better than a global monitor)

� Can scale more gracefully since as the system grows; centralized monitor has to communicate with more and more nodes

� Can better respond to fluctuations in process requirements

� Scenario where it has to "forget" past process requirements and completely rebuild new clusters after servicing one process i.e. no locality

Page 4: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� An agent is a computing node; join together to form a cluster

� Multi-agent systems have emergent properties

� Have been used to model biological phenomenon and real-life problems (left: Keepaway soccer, right: Ant foraging):

Multi-Agent Systems

Page 5: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� A huge number of distributed nodes or agents

� Advantages to computing with geographically proximal computers due to network latency, bandwidth limitations, etc

� There is a global data structure which has a large number of tasks/processes

� A new process that comes in the system will declare a priori the number of threads that it can be parallelized into and its resource requirements (CPUreq)

� Cluster as a network of computers which together can completely service the resource requirements of a single task

� Over time clusters would be created, dissolved and created again dynamically in order to serve the resource requirements of the tasks in the queue

Problem Statement and Assumptions

Page 6: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� dRAP: Distributed Resource Allocation Procedure

� Mode 1: an agent/node that is currently not part of a cluster and has no task assigned to it

1. agent looks at queue Q, examines unallocated tasks and takes on the task which minimizes

� Mode 2: an agent/node that is currently not part of a cluster and has a task assigned to it

1. keep on executing task

2. if the task requirements are not completely satisfied, i.e., keep on querying your neighbors and try to

form a cluster such that

3. when task completes, go to Mode 1

dRAP Algorithm

|1| −reqCPU

1>reqCPU

CPU req = CPU cluster

Page 7: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Mode 3: an agent/node that is currently part of a cluster and has no task assigned to it

1. agent looks at queue Q, examines unallocated tasks and takes on the task which minimizes

� Mode 4: an agent/node that is currently part of a cluster and has a task assigned to it

1. keep on executing task2. when task completes, breakup cluster and go to Mode 1

dRAP Algorithm

|CPUreq −CPUcluster |

Page 8: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Caveat: Task list traversal requires O(nm) time per timestep, where n = number of tasks and m = number of clusters

� For entire simulation:

� Compare to FIFO scheduling - drops to O(nm)

� Does our algorithm’s increased complexity per timestep provide enough decrease in scheduling rate to be effective?

dRAP Algorithm

)()( 2

0

mnOminn

i

≈−∑=

Page 9: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):

Simulation

Page 10: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):

Simulation

Page 11: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):

Simulation

Page 12: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Comparisons with a null model (FIFO scheduling algorithm)

� Time to empty queue (of 1000 tasks) = Tcomplete

� Average waiting time (averaged over 1000 tasks) = Twait

� Values given in simulation time steps:

Experiments

Tcomplete Twait

RAP 845.60 342.54

FIFO 1071.20 475.31

Page 13: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Utilization experiments

� We compared the cluster utilization ability of our algorithm vs. the FIFO scheduling algorithm

� Calculation for each task: (averaged over total number of tasks)

� Optimal value is 100% (our algorithm always achieves this):

Experiments

Utilization

RAP 100%

FIFO 56%

cluster

req

Nodes

Nodes

Page 14: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Lastly we looked at how the average waiting time and time to completion scaled with the number of nodes in the system

Experiments

0

400

800

1200

1600

2000

0 200 400 600

T co

mp

lete

Nodes

Scaling of Tcomplete

0

200

400

600

800

0 200 400 600

T wai

t

Nodes

Scaling of Twait

Page 15: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Same data using log2 on axes and a power curve fit:

Experiments

y = 63630x-0.927

R² = 0.9976

128

256

512

1024

2048

40 80 160 320 640

T co

mp

lete

(lo

g2)

Nodes (log2)

Scaling of Tcomplete

y = 47010x-1.075

R² = 0.9992

64

128

256

512

1024

40 80 160 320 640

T wai

t(lo

g2)

Nodes (log2)

Scaling of Twait

Page 16: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Optimizations Inspired by the Natural Immune System

• Operates under constraints of physical space

• Resource constrained (metabolic input, number of immune system cells)

• Performance scalability is an important concern (mice to horses)(Banerjee and Moses, 2010, in review)

Page 17: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Search Problem

• They have to search throughout the whole body to locate small quantities of pathogens

Page 18: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Response Problem

• Have to respond by producing antibodies

Page 19: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Nearly Scale-Invariant Search and Response

• How does the immune system search and respond in almost the same time irrespective of the size of the search space?

Page 20: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Crivellato et al. 2004

Solution?

Page 21: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Lymph Nodes (LN)

• A place in which IS cells and the pathogen can encounter each other in a small volume

• Form a decentralized detection network

Page 22: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Decentralized Detection Network

www.lymphadvice.com

Page 23: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Lymph Node Dynamics

Page 24: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Lymph Node Dynamics

Page 25: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Lymph Node Dynamics

Page 26: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing
Page 27: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing
Page 28: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing
Page 29: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing
Page 30: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Summary

• There are increasing costs to global communication as organisms grow bigger

• Semi-modular architecture balances the opposing goals of detecting pathogen (local communication) and recruiting IS cells (global communication)

• Can we emulate this modular RADAR strategy in distributed systems?

Page 31: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Optimizations inspired by the immune system

Page 32: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

Optimizations inspired by the immune system

Page 33: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� The move towards distributed computing necessitates efficient scheduling algorithms

� Decentralized scheduling of large number of nodes leads to robustness, reduces load on centralized monitor and better response to fluctuations in task queue requirements

� Multi-agent systems have emergent properties and have been used here to adaptively create and allocate clusters to match task demand

� The algorithm outperforms our null model (FIFO scheduling) on average waiting time, time to empty task queue and utilization

� Further, our algorithm is robust to adversarial attack (task queue fluctuations in task processor requirements)

Conclusions

Page 34: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Value of immune system inspired approaches

� General theory of scaling of artificial immune systems

Conclusions

Page 35: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Compare with more null algorithms

� Compare with algorithms used in industry e.g. SLURM uses static allocation of nodes to clusters known as partitions

� Compare with cluster allocation algorithm used by Google in MapReduce (this algorithm can improve on their locality optimization since it seeks to form clusters with its neighbors)

� … and sell to the highest bidder!

Future Work

Page 36: A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

� Dr. Dorian Arnold

Acknowledgements