1 december 12, 2009 robust asynchronous optimization for volunteer computing grids department of...

34
1 December 12, 2009 Robust Asynchronous Optimization for Volunteer Computing Grids Department of Computer Science Department of Physics, Applied Physics and Astronomy Rensselaer Polytechnic Institute E-Science 2009 December 12, Oxford, UK Travis Desell, Malik Magdon- Ismail, Boleslaw Szymanski, Carlos Varela, Heidi Newberg, Nathan Cole

Upload: moses-houston

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

1December 12, 2009

Robust Asynchronous Optimizationfor Volunteer Computing Grids

Department of Computer ScienceDepartment of Physics, Applied Physics and Astronomy

Rensselaer Polytechnic Institute

E-Science 2009December 12, Oxford, UK

Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski, Carlos Varela,

Heidi Newberg, Nathan Cole

December 12, 2009 2

Overview

Introduction• Motivation• Driving Scientific

Application

Asynchronous Genetic Search• Why asynchronous?• Methodology• Recombination• Particle Swarm

Optimization

Generic Optimization Framework• Approach• Architecture

Results• Convergence Rates• Re-computation rates

Conclusions & Future Work

Questions?

December 12, 2009 3

Motivation

Scientists need easily accessible distributed optimization tools

Distribution is essential for scientific computing• Scientific models are becoming increasingly complex• Rates of data acquisition are far exceeding increases in

computing power

Traditional optimization strategies not well suited to large scale computing• Lack scalability and fault tolerance

December 12, 2009 4

Astro-Informatics

• Observing from inside the Milky Way provides 3D data:• SLOAN digital sky survey has collected over 10 TB data.• Can determine it's structure – not possible for other galaxies.• Very expensive – evaluating a single model of the Milky Way with a single set of

parameters can take hours or days on a typical high-end computer.

• Models determine where different star streams are in the Milky Way, which helps us understand better its structure and how it was formed.

What is the structure and origin of the Milky Way

galaxy?

Computed Paths of Sagittarius Stream

December 12, 2009 6

Separation of Concerns• Distributed Computing

• Optimization

• Scientific Modeling

“Plug-and-Play”• Simple & generic interfaces

Generic Optimization Framework

December 12, 2009 7

Two Distribution Strategies

Asynchronous evaluations Results may not be

reported or reported late No processor

dependencies

o Faults can be ignored

• Grids & Internet

Single parallel evaluation Always uses most evolved

population Can use traditional

methods

o Faults require recalculation o Grids require load

balancing

• Supercomputers & Grids

December 12, 2009 8

Asynchronous Architecture

Scientific Models

Distributed Evaluation Framework

Search Routines

…EvaluatorCreation

Data Initialisation

Integral FunctionIntegral Composition

Likelihood FunctionLikelihood Composition

InitialParameters

OptimisedParameters

BOINC (Internet) SALSA/Java (RPI Grid)

Evaluator (N)

WorkRequest

ResultsWork

RequestResultsWork Work

Evolutionary MethodsGenetic Search

Particle Swarm Optimisation…

Evaluator (1)

December 12, 2009 9

GMLE Architecture (Parallel-Asynchronous)

Results

DistributeParameters

CombineResults

Evaluator(1)

Evaluator(N)

Evaluator(2)

WorkWork

Request

Worker (1)

Results

DistributeParameters

CombineResults

Evaluator(1)

Evaluator(M)

Evaluator(2)

WorkWork

Request

Worker (Z)

CommunicationLayer BOINC - HTTP Grid - TCP/IP Supercomputer - MPI

Search Routines

MPI MPI

December 12, 2009 10

Issues With Traditional Optimization

Traditional global optimization techniques are evolutionary, but dependent on previous steps and are iterative• Current population is used to generate the next population

Dependencies and iterations limit scalability and impact performance• With volatile hosts, what if an individual in the next generation is

lost?• Redundancy is expensive• Scalability limited by population size

December 12, 2009 11

Asynchronous Optimization Strategy

Use an asynchronous methodology• No dependencies on unknown results• No iterations

Continuously updated population• N individuals are generated randomly for the initial population• Fulfil work requests by applying recombination operators to the

population • Update population with reported results

December 12, 2009 12

Asynchronous Search Strategy

Work QueuePopulation

Workers

Generate membersfrom population

Request workwhen queue is low

Parameter Set (1)Parameter Set (2)

Parameter Set (n)

.

.

.

.

.

Fitness (1)Fitness (2)

.

.

.

.

.

Fitness (n)

Unevaluated Parameter Set (1)Unevaluated Parameter Set (2)

Unevaluated Parameter Set (m)

.

.

.

.

.

Report results andupdate population

Sendwork

Requestwork

December 12, 2009 13

Asynchronous Genetic Search Operators (1)

Average• Simple operator for continuous problems• Generated parameters are the average of two randomly selected

parents

Mutation• Takes a parent and generates a mutation by randomly selecting a

parameter and mutating it

December 12, 2009 14

Asynchronous Genetic Search Operators (2)

Double Shot - two parents generate three children• Average of the parents• Outside the less fit parent, equidistant to parent and average• Outside the more fit parent, equidistant to parent and average

December 12, 2009 15

Asynchronous Genetic Search Operators (3)

Probabilistic Simplex• N parents generate one or more children• Points placed randomly along the line created by the worst parent,

and the centroid (average) of the remaining parents

16December 12, 2009

Particle Swarm Optimization

Particles ‘fly’ around the search space.

They move according to their previous velocity and are pulled towards the global best found position and their locally best found position.

Analogies:cognitive intelligence (local best knowledge)

social intelligence (global best knowledge)

16

17December 12, 2009

Particle Swarm Optimization

PSO:vi(t+1) = w * vi(t) + c1 * r1 * (li - pi(t)) + c2 * r2 * (g - pi(t))

pi(t+1) = pi(t) + vi(t+1)

w, c1, c2 = constants

r1, r2 = random float between 0 and 1

vi(t) = velocity of particle i at iteration t

pi(t) = position of particle i at iteration t

li = best position found by particle i

g = global best position found by all particles

17

18December 12, 2009

Asynchronous PSO

Generating new positions does not necessarily require the fitness of the previous position

1. Generate new particle or individual positions to fill work queue

2. Update local and global best on resultsPSO:

If result improves particle’s local best, update local best, particle’s position and velocity of the result

18

19December 12, 2009

Particle Swarm Optimization (Example)

19

previous: pi(t-1)

current: pi(t)

local best

global best

c1 * (li - pi(t))c2 * (g - pi(t))

w * vi(t)

velocity: vi(t)

possible newpositions

20 December 12, 2009

Particle Swarm Optimization (Example)

20

previous: pi(t-1)

current: pi(t)

local best

global best

c2 * (g - pi(t))

w * vi(t)

velocity: vi(t)

possible newpositions

Particle finds a new local best position and the global best position

previous: pi(t-1)

current: pi(t)

local best

global best

velocity: vi(t)new position

21December 12, 2009

Particle Swarm Optimization (Example)

21

c2 * (g - pi(t))

w * vi(t)

possible newpositions

c1 * (li - pi(t))

previous: pi(t-1) current: pi(t)

local best

global best

velocity: vi(t)

Another particle finds the global best position

22December 12, 2009 22

Population

Fitness (1)

Fitness (2)

Fitness (n)

.

.

.

.

.

.

.

.

Individual (1)

Individual (2)

Individual (n)

.

.

.

.

.

.

.

.

Unevaluated Individuals

Unevaluated Individual (1)

Unevaluated Individual (2)

Unevaluated Individual (n)

.

.

.

.

.

.

.

.

Workers (Fitness Evaluation)

Report results and update population

Request Work

Send Work

Generate individuals when queue is low

Local and global best updated if new individual

has better fitness

Select individual to generate new individual from in

round-robin manner

Asynchronous PSO

Computing Environment: Milkyway@home

http://milkyway.cs.rpi.edu BOINC

Einstein@home, SETI@home, etc

>50,000users; 80,000 CPUs; 600 teams; from 99 countries; Second largest BOINC computation (among 100’s)About 500 Teraflops

Donate your idle computer time to help perform our calculations.

December 12, 200923

24December 12, 2009

MilkyWay@Home – Growth of Power

24

December 12, 2009 25

Computing Environments - BOINC

• MilkyWay@Home: http://milkyway.cs.rpi.edu/

• Multiple Asynchronous Workers• Approximately 10,000 – 30,000 volunteered computers engages at

a time• Asynchronous architecture used

• Asynchronous Evaluation• Volunteered computers can queue up to 20 pending individuals• Population updated when results reported• Individuals may be reported slowly or not at all

December 12, 2009 26

Handling of Work Units by the BOINC Server

User Participation

Users do more than volunteer computing resources (Citizen’s Science):

Open-source code gives users access to the MilkyWay@Home application

Users have submitted many bug reports, fixes, and performance enhancements

A user even created an ATI GPU capable version of the MilkyWay@Home application

Forums provide opportunities for users to learn about astronomy and computer science

December 12, 2009

Malicious/Incorrect Result Verification

With open-source application code, users can compile their own compiler-optimized versions and many do.

However, there is also the possibility of users returning malicious results

BOINC traditionally uses redundancy on every result to verify their correctness. This requires at least 2 results

for every work unit!

Asynchronous search doesn't require all work units to be verified, only those which improve the population

We reduce the redundancy by comparing a result against the current partial results.

December 12, 2009

29December 12, 2009

Limiting Redundancy (Genetic Search)

29

60% verification found best solutions

Increased verification reduces reliability

Reliability and convergence by number of parents seems dependent on verification rate

30December 12, 2009

Limiting Redundancy (PSO)

30

30% verification found best solutions

Increased verification reduces reliabilityNot as dramatically as AGS

Lower inertia weights give better results

31December 12, 2009

Optimization Method Comparison

31

APSO found better solutions than AGS.

APSO needed lower verification rates and was less effected by different verification rates.

December 12, 2009 32

Conclusions

Asynchronous search is effective on large scale computing environments• Fault tolerant without expensive redundancy• Asynchronous evaluation on heterogeneous environment

increases diversity• BOINC converges almost as fast as the BlueGene, while offering

more availability and computational power• Even computers with slow result report rates are useful

Particle Swarm and Simplex-Genetic Hybrid methods provide significant improvement in convergence

December 12, 2009 33

Future Work

Optimization• Use report times to determine how to

generate individuals• Simulate asynchrony for benchmarks• Automate selection of parameters

Distributed Computing• Parallel asynchronous workers• Handle Malicious “Volunteers”

Continued Collaboration

http://www.nasa.gov

December 12, 2009 34

Questions?