a generic fault tolerant system for dynamic scheduling in distributed system

16
07/18/22 CS253 Fall 05 by Dr Vana Kalo geraki 1 A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System Class project by Piyush Ranjan Satapathy & Van Lepham

Upload: shea-justice

Post on 31-Dec-2015

16 views

Category:

Documents


0 download

DESCRIPTION

A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System. Class project by Piyush Ranjan Satapathy & Van Lepham. Problem Addressed. What happens when scheduled jobs in a real distributed system fail due to fault causing behaviors? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

1

A Generic Fault Tolerant System for Dynamic

Scheduling in Distributed System

Class project by

Piyush Ranjan Satapathy &Van Lepham

Page 2: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

2

Problem Addressed

What happens when scheduled jobs in a real distributed system fail due to fault causing behaviors?

Can I schedule my jobs in different type of resources? Is there a very generic tool for dynamic distributed

scheduling of jobs?

Generic in the sense of any kind of resource, any kind of job, suitable algorithm and fault tolerant ?

Jobs ranging from running regression test cases to complex scientific calculations; anything which can run in parallel

Page 3: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

3

Motivation

Can we have a system which can execute our distributed jobs dynamically or statically irrespective of any fault behaviors, minimizing the number of resources used and minimizing the total time of completion?

Can we make it up to a standard of both academics and industry?

Page 4: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

4

Our Contribution

A system from scratch

Monitoring and Feedback Mechanism in real life parallel job execution

Implemented and Evaluated 5 Algorithms

A Java GUI for better user interaction

Page 5: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

5

Outline

Introduction Tools Visited (Related Work) Our Central Idea (Architecture) Implementation Evaluation Conclusion Next Step

Page 6: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

6

Introduction

There is no such tool academically or industry-wise which can fit into lot of environments

Our objective is easy to use, port and enhance

We monitor the running jobs and scheduled machines, collect info and use it for scheduling analysis

Our initial result of evaluation goes up to 10-15% of better performance over a resource of 40 machines out of which 11 are faulty

Page 7: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

7

Tools Visited OpenSTARS (2005)

A flexible Real time and optimized tool to schedule the jobs in distributed system

But Don’t take care of any fault tolerant behaviors Source: http://rtdev.cs.uri.edu/svn/repos/trunk/

SPHINX(2005) A fault tolerant system for scheduling on dynamic Grid

environment Source: http://sphinx.phys.ufl.edu/

Cheddar(2004) Free Real Time Scheduling Tool based EDF and LLF Source: http://beru.univ-brest.fr/~singhoff/cheddar/

VEST(2003) Real time Schedulability Analysis for software to Hardware

allocations. Source: http://www.cs.virginia.edu/~ty4k/vestpage/

Page 8: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

8

Tools Visited

STAF(2001) Software Test Automation FrameWork (IBM) across a number

of machines. Good for software nightly build Source: http://staf.sourceforge.net/index.php

TimeWiz(2000) A Comprehensive tool for real-time modeling and analysis. Source:http://www.timesys.com/products/timewiz/

RapidRMA(1993) Based on EDF and CORBA based Real time system Source:

http://www.tripac.com/html/downloads.html

Page 9: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

9

Core Architecture

NativeMachine

List of Machines

GridClusters

LSF

Supervisor &Keeping History

Resource Monitor

Job Monitor

Analyzer

HistoryStorage

User scheduling job

Sets Algorithm

Middleware Wide Variety of Resources

3 Layers: 1. Adoptive Analyzer 2. Monitoring and Feedback

3. Information Storage

Page 10: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

10

Implementation An individual Job is presented as a script

Job list contains all jobs to be executed

Machine list contains name of machines or name of grid sites

Login Information (submit command, status command, kill command) for Grid/Lsf/Remote machine given

Either can be static or Dynamic

Monitoring and feedback can be turned on and off

Page 11: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

11

Implementation (GUI)

Page 12: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

12

Experimental Setup

40 Machines inside the EBII Building Fault Causing behaviors as below;

5 Machines are not connected at all 6 Machines connect and hang

60 Jobs Execution time differs from 2 sec to 5 minutes

4 Algorithms considered Round Robin (Working) # CPU Based (Working) Job Completion Based (Working) EDF (..Yet to Work ) LLF (…Yet to Work)

Page 13: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

13

Evaluation

Monitoring and Feedback Analysis

0

2

4

6

8

10

12

14

10 20 30 40 50 60

No of Jobs executed

Ave

rag

e N

o o

f R

esch

edu

lin

gs

#CPU Based

Round RObin

Performance Analysis

0

50

100

150

200

250

RoundRobin #CPU Based Job CompletionBased

Scheduling Algorithm

Tim

e (S

ec)

No Monitor No Feedback

Monitor and Feedback

Page 14: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

14

Conclusion

Designed a top to bottom dynamic distributed system

Implemented the fault tolerant techniques by monitoring and feedback

Stored the type of job and job history which executes once to make analyzer’s job easy

Got some initial interesting results over small experiments

Page 15: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

15

What’s Next ?

We didn’t get a login ID either in Planet lab or in grid3

Experiment in some grid

Implementing more algorithms

Running in real benchmarks

Page 16: A Generic Fault Tolerant System for Dynamic Scheduling in Distributed System

04/19/23 CS253 Fall 05 by Dr Vana Kalogeraki

16

Thank You